ropengov / eurostat Goto Github PK

View Code? Open in Web Editor NEW

234.0 28.0 45.0 113.88 MB

R tools for Eurostat data

Home Page: http://ropengov.github.io/eurostat

License: Other

R 99.89% Shell 0.11%

eurostat eurostat-data r ropengov r-package cran

eurostat's Introduction

eurostat R package

R tools to access open data from Eurostat. Data search, download, manipulation and visualization.

Installation and use

Install stable version from CRAN:

install.packages("eurostat")

Alternatively, install development version from GitHub:

# Install from GitHub
library(devtools)
devtools::install_github("ropengov/eurostat")

Development version can be also installed using the r-universe:

# Enable this universe
options(repos = c(
  ropengov = "https://ropengov.r-universe.dev",
  CRAN = "https://cloud.r-project.org"
))

install.packages("eurostat")

The package provides several different ways to get datasets from Eurostat. Searching for data is one way, if you know what to look for.

# Load the package
library(eurostat)

# Perform a simple search and print a table
passengers <- search_eurostat("passenger transport")
knitr::kable(head(passengers))

title	code	type	last.update.of.data	last.table.structure.change	data.start	data.end	values	hierarchy
Air passenger transport	enps_avia_pa	dataset	13.03.2023	13.03.2023	2005	2021	406	6
Modal split of air, sea and inland passenger transport	tran_hv_ms_psmod	dataset	29.06.2023	29.06.2023	2008	2021	2100	4
Modal split of inland passenger transport	tran_hv_psmod	dataset	29.06.2023	29.06.2023	1990	2021	4219	4
Volume of passenger transport relative to GDP	tran_hv_pstra	dataset	11.08.2023	29.06.2023	1990	2021	969	4
Maritime passenger transport performed in the Exclusive Economic Zone (EEZ) of the countries	mar_tp_pa	dataset	21.02.2023	21.02.2023	2005	2021	1752	4
Air passenger transport by reporting country	avia_paoc	dataset	04.12.2023	28.11.2023	1993	2023-Q3	2482969	5

See the Tutorial and other resources at the package homepage for more information and examples.

Recommended packages

It is recommended to install the giscoR package (https://dieghernan.github.io/giscoR/). This is another API package that provides R tools for Eurostat geographic data to support geospatial analysis and visualization.

Contribute

Contributions are very welcome:

Use issue tracker for feedback and bug reports.
Send pull requests
Star us on the Github page
Join the discussion in Gitter

Acknowledgements

Kindly cite this package by citing the following R Journal article:

Lahti L., Huovari J., Kainu M., and Biecek P. (2017). Retrieval and analysis of Eurostat open data with the eurostat package. The R Journal 9(1), pp. 385-392. doi: 10.32614/RJ-2017-019.

In addition, please provide a citation to the specific software version used:

Lahti, L., Huovari J., Kainu M., Biecek P., Hernangomez D., Antal D., and Kantanen P. (2023). eurostat: Tools for Eurostat Open Data [Computer software]. R package version 4.0.0.9003. https://github.com/rOpenGov/eurostat

We are grateful to all contributors, including Daniel Antal, Joona Lehtomäki, Francois Briatte, and Oliver Reiter, and for the Eurostat open data portal! This project is part of rOpenGov.

Disclaimer

This package is in no way officially related to or endorsed by Eurostat.

When using data retrieved from Eurostat database in your work, please indicate that the data source is Eurostat. If your re-use involves some kind of modification to data or text, please state this clearly to the end user. See Eurostat policy on copyright and free re-use of data for more detailed information and certain exceptions.

eurostat's People

Contributors

Stargazers

Watchers

Forkers

jhuovari gitter-badger jamieon gancedo gvanzin xiangsunrabbit yohanrobinson davidsoloman hunjaechung antoniofraga nunofernandes-plight jlehtoma tris-sondon alkavaev kashenfelter ktaranov paulrougieux michbur chrserious vinciuna olhmr nikolaospapachristou jessicarjes zauster antaldaniel cbizzo mdtr-cbs jeaniek olgnaydn robonomist minghao2016 fozy81 retostauffer pitkant vero1166 jimhester hadrilec djhurio thomas-ayissi minemr dieghernan ake123 seracl cgodlewski

eurostat's Issues

Cache file should record time_format argument on the name

Now changing time_format after caching won't have an effect.

URLS changed

Eurostat changed the structure and URLs of their site. Accordingly, some of the functions in eurostat, e.g. getEurostatRaw are not working anymore. I think in getEurostatRaw changing the bulk download url from:

adres <- paste("http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=data%2F",
kod, ".tsv.gz", sep = "")

adres <- paste("http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2F",
kod, ".tsv.gz", sep = "")
http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2F

should do the trick.

Thank you!

dw006

`several time frequencies` Error

I've encountered the following error for the cens_01rdhh eurostat table

> cens_01rdhh <- get_eurostat(id = 'cens_01rdhh')
trying URL 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fcens_01rdhh.tsv.gz'
Content type 'application/octet-stream;charset=UTF-8' length 68495 bytes (66 KB)
==================================================
downloaded 66 KB

Error in tidy_eurostat(y_raw, time_format, select_time, stringsAsFactors = stringsAsFactors,  : 
  Data includes several time frequencies. Select frequency with
         select_time or use time_format = "raw".
         Available frequencies: '''1''2''3''4''5''6''0''7''8''9''A''B''C''D''E''F'

EDIT: get_eurostat_json works fine.

My session info:

> devtools::session_info()
Session info -------------------------------------------------------------
 setting  value                       
 version  R version 3.2.2 (2015-08-14)
 system   x86_64, darwin14.5.0        
 ui       RStudio (0.99.878)          
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       Europe/Warsaw               
 date     2016-04-16                  

Packages -----------------------------------------------------------------
 package    * version     date       source                            
 assertthat   0.1         2013-12-06 CRAN (R 3.2.0)                    
 colorspace   1.2-6       2015-03-11 CRAN (R 3.2.0)                    
 curl         0.9.7       2016-04-10 CRAN (R 3.2.2)                    
 DBI          0.3.1       2014-09-24 CRAN (R 3.2.0)                    
 devtools     1.11.0      2016-04-12 CRAN (R 3.2.2)                    
 digest       0.6.9       2016-01-08 CRAN (R 3.2.2)                    
 dplyr      * 0.4.3       2015-09-01 CRAN (R 3.2.0)                    
 eurostat   * 1.2.21.9002 2016-04-16 Github (rOpenGov/eurostat@59205c3)
 ggplot2    * 2.1.0       2016-03-01 CRAN (R 3.2.2)                    
 ggthemes   * 3.0.3       2016-04-09 CRAN (R 3.2.2)                    
 git2r        0.14.0      2016-03-13 CRAN (R 3.2.2)                    
 gtable       0.2.0       2016-02-26 CRAN (R 3.2.2)                    
 htmltools    0.3.5       2016-03-21 CRAN (R 3.2.2)                    
 httr         1.1.0       2016-02-02 Github (hadley/httr@a68c86c)      
 knitr        1.12.3      2016-01-22 CRAN (R 3.2.2)                    
 labeling     0.3         2014-08-23 CRAN (R 3.2.0)                    
 lazyeval     0.1.10      2015-01-02 CRAN (R 3.2.0)                    
 lubridate  * 1.5.6       2016-04-06 CRAN (R 3.2.4)                    
 magrittr     1.5         2014-11-22 CRAN (R 3.2.0)                    
 memoise      1.0.0       2016-01-29 CRAN (R 3.2.2)                    
 munsell      0.4.3       2016-02-13 CRAN (R 3.2.2)                    
 plyr         1.8.3       2015-06-12 CRAN (R 3.2.0)                    
 R6           2.1.2       2016-01-26 CRAN (R 3.2.3)                    
 Rcpp         0.12.4      2016-03-26 CRAN (R 3.2.4)                    
 rmarkdown    0.9.5       2016-02-22 CRAN (R 3.2.2)                    
 scales       0.4.0       2016-02-26 CRAN (R 3.2.2)                    
 stringi      1.0-1       2015-10-22 CRAN (R 3.2.2)                    
 stringr      1.0.0       2015-04-30 CRAN (R 3.2.0)                    
 tidyr      * 0.4.1       2016-02-05 CRAN (R 3.2.2)                    
 withr        1.0.1       2016-02-04 CRAN (R 3.2.2)                    
 yaml         2.1.13      2014-06-12 CRAN (R 3.2.0)

Way to clear a cache

A function to clear a cache. Needs a directory for the cache in temp.dir.

Implemented in 3915056. clean_eurostat_cache().

contact email for eurostat

At the moment we have [email protected] as the contact address for eurostat package. The [email protected] would be better as this is an international package (louhos is in use with Finnish packages that have little use in international context, and can use Finnish language). I would expect a wider international community of developers on the ropengov list. Note both of these lists are public.

Reorder factors based on dictionary

Some variables, like classifications, have logical or at least conventional ordering. Eurostat data tables does not seems to have this ordering, but dictionaries seems to have. label_eurostat() could be modified to reorder factor levels in data to the order from dictionaries. Optionally?

The bulk download URL will change

See http://epp.eurostat.ec.europa.eu/portal/page/portal/help/new_eurostat_website
The launch of the new Eurostat website is planned during 12-13-14 December.

The bulk download URL will change from
http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing
to
http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing

replace plyr with dplyr

This will be useful for many compatibility issues as plyr and dplyr are poorly compatible, and dplyr is now rapidly becoming more popular.

Implement cache

Enable cacheing for large data sets. For instance following the examples outlined in pxweb.

Errors after `eurostat` package updates

I have updated eurostat package to version 1.2.13 on R 3.2.2. In result, I am not able to read into R tables that I have selected. Please find below two type errors that I have got and sessionInfo.

> eurostat::get_eurostat('isoc_cimobi_dev')
trying URL 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fisoc_cimobi_dev.tsv.gz'
Content type 'application/octet-stream;charset=UTF-8' length 471305 bytes (460 KB)
==================================================
downloaded 460 KB

Error in if (tcode != "_" && nchar(times[1]) > 7) { : 
  missing value where TRUE/FALSE needed
> eurostat::get_eurostat('isoc_cias_mph')
trying URL 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fisoc_cias_mph.tsv.gz'
Content type 'application/octet-stream;charset=UTF-8' length 46481 bytes (45 KB)
==================================================
downloaded 45 KB

Error in tidy_eurostat(y_raw, time_format, select_time, stringsAsFactors = stringsAsFactors,  : 
  Data includes several time frequencies. Select frequency with
         select_time or use time_format = "raw".
         Available frequencies: 'Y'''

> devtools::session_info()
Session info ------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.2 (2015-08-14)
 system   x86_64, darwin14.5.0        
 ui       RStudio (0.99.862)          
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       Europe/Warsaw               
 date     2016-01-22                  

Packages ----------------------------------------------------------------------------------
 package      * version date       source                               
 assertthat     0.1     2013-12-06 CRAN (R 3.2.0)                       
 curl           0.9.4   2015-11-20 CRAN (R 3.2.2)                       
 DBI            0.3.1   2014-09-24 CRAN (R 3.2.0)                       
 devtools       1.9.1   2015-09-11 CRAN (R 3.2.0)                       
 digest         0.6.9   2016-01-08 CRAN (R 3.2.2)                       
 dplyr        * 0.4.3   2015-09-01 CRAN (R 3.2.0)                       
 eurostat     * 1.2.13  2016-01-19 CRAN (R 3.2.2)                       
 httr           1.0.0   2015-06-25 CRAN (R 3.2.0)                       
 knitr          1.12    2016-01-07 CRAN (R 3.2.2)                       
 lazyeval       0.1.10  2015-01-02 CRAN (R 3.2.0)                       
 magrittr       1.5     2014-11-22 CRAN (R 3.2.0)                       
 memoise        0.2.1   2014-04-22 CRAN (R 3.2.0)                       
 PISA2012lite * 1.0     2016-01-22 Github (pbiecek/PISA2012lite@1616df2)
 R6             2.1.1   2015-08-19 CRAN (R 3.2.0)                       
 Rcpp           0.12.3  2016-01-10 CRAN (R 3.2.2)                       
 Rpoppler     * 0.0-1   2015-07-03 CRAN (R 3.2.2)                       
 SDaA         * 0.1-3   2014-09-04 CRAN (R 3.2.0)                       
 stringi        1.0-1   2015-10-22 CRAN (R 3.2.2)                       
 stringr        1.0.0   2015-04-30 CRAN (R 3.2.0)                       
 tidyr          0.4.0   2016-01-18 CRAN (R 3.2.2)

get_eurostat_dic code type error

Got the following by email from WD:

I am playing with your good eurostat library and I discovered a problem getting the dictionary, e.g. try to get the “ind_farm” dictionary.

You wil get:
1 Total number of holdings

But this is wrong it should be:
001 Total number of holdings

The problem lies in the fact that the first column is treated as a numeric column and not a character column, so in the read.table statement you should put a colClasses statement:

get_eurostat_dic=function(dictname,lang='en'){
read.table(paste0("http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=dic%2F",
lang, "%2F", dictname, ".dic"), sep = "\t", header = FALSE, colClasses = c('character','character'),
stringsAsFactors = FALSE, quote = """, fileEncoding = "Windows-1252")

}

get_eurostat() does not handle daily data correctly

It seems that the workhorse function eurostat::get_eurostat() does not handle daily data correctly. It can convert dates only down to monthly frequency.

library(eurostat)

# Euro/ECU exchange rates - daily data
id <- "ert_bil_eur_d"

dat <- 
  get_eurostat(id, time_format = "raw")

dat_date <-
  get_eurostat(id, time_format = "date", update_cache = TRUE)

dat: time_format = "raw"

tbl_df(dat)
Source: local data frame [379,584 x 5]

   statinfo unit currency       time  values
1       AVG  NAC      ARS 2015M07D03 10.1028
2       AVG  NAC      AUD 2015M07D03  1.4747
3       AVG  NAC      BGN 2015M07D03  1.9558
4       AVG  NAC      BRL 2015M07D03  3.4584
5       AVG  NAC      CAD 2015M07D03  1.3961
6       AVG  NAC      CHF 2015M07D03  1.0466
7       AVG  NAC      CNY 2015M07D03  6.8856
8       AVG  NAC      CZK 2015M07D03 27.1450
9       AVG  NAC      DKK 2015M07D03  7.4607
10      AVG  NAC      GBP 2015M07D03  0.7102

dat_date: time_format = "date"

tbl_df(dat_date)
Source: local data frame [379,584 x 5]

   statinfo unit currency       time  values
1       AVG  NAC      ARS 2015-07-01 10.1028
2       AVG  NAC      AUD 2015-07-01  1.4747
3       AVG  NAC      BGN 2015-07-01  1.9558
4       AVG  NAC      BRL 2015-07-01  3.4584
5       AVG  NAC      CAD 2015-07-01  1.3961
6       AVG  NAC      CHF 2015-07-01  1.0466
7       AVG  NAC      CNY 2015-07-01  6.8856
8       AVG  NAC      CZK 2015-07-01 27.1450
9       AVG  NAC      DKK 2015-07-01  7.4607
10      AVG  NAC      GBP 2015-07-01  0.7102

Session info

devtools::session_info()
Session info ---------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.0 (2015-04-16)
 system   x86_64, mingw32             
 ui       RStudio (0.99.435)          
 language (EN)                        
 collate  Slovenian_Slovenia.1250     
 tz       Europe/Prague               

Packages -------------------------------------------------------------------------------------------------------------------------
 package    * version date       source        
 assertthat   0.1     2013-12-06 CRAN (R 3.1.3)
 curl         0.9     2015-06-19 CRAN (R 3.2.1)
 DBI          0.3.1   2014-09-24 CRAN (R 3.1.3)
 devtools     1.8.0   2015-05-09 CRAN (R 3.2.0)
 digest       0.6.8   2014-12-31 CRAN (R 3.1.3)
 dplyr      * 0.4.2   2015-06-16 CRAN (R 3.2.1)
 eurostat   * 1.0.16  2015-03-27 CRAN (R 3.2.1)
 git2r        0.10.1  2015-05-07 CRAN (R 3.2.0)
 lazyeval     0.1.10  2015-01-02 CRAN (R 3.1.3)
 magrittr     1.5     2014-11-22 CRAN (R 3.1.3)
 memoise      0.2.1   2014-04-22 CRAN (R 3.1.3)
 plyr       * 1.8.3   2015-06-12 CRAN (R 3.2.1)
 R6           2.0.1   2014-10-29 CRAN (R 3.1.3)
 Rcpp         0.11.6  2015-05-01 CRAN (R 3.2.0)
 reshape2     1.4.1   2014-12-06 CRAN (R 3.1.3)
 rversions    1.0.1   2015-06-06 CRAN (R 3.2.0)
 stringi      0.5-5   2015-06-29 CRAN (R 3.2.1)
 stringr    * 1.0.0   2015-04-30 CRAN (R 3.2.0)
 tidyr        0.2.0   2014-12-05 CRAN (R 3.1.3)
 xml2         0.1.1   2015-06-02 CRAN (R 3.2.0)

Add option stringsasfactors=FALSE to get_eurostat()

Conflicting value column name

I accidentally opened this issue on pxweb side (rOpenGov/pxweb#84).

To recap it briefly here:
At least dataset sbs_pen_7b1 includes variable named value, so as column for values is also called value there would be two columns named value.

Solutions:

Change the column for values to values as in pxweb. But this will break users existing code.
Do some dirty fix. Like using VALUE (upper case) for variable name.

If no objections, I will change for values

An option for label_eurostat() to retain spesified code columns.

For data.frame use a code argument.

Fix cran error

In vignette. Due to, I think, eurostat change.

Use read_tsv from readr instead of read.table

Tested new readr package to read the tsv file. It is somewhat faster, but the difference is not a big. If implemented, should change also the tidy_eurostat()

tested:

id <- "avia_goincc"
base <- eurostat_url()         
url <- paste(base, 
             "estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2F",
             id, ".tsv.gz", sep="")

tfile <- tempfile(fileext = ".tsv.gz")
# download and read file
download.file(url, tfile)

# Present:
system.time(dat1 <- read.table(gzfile(tfile), sep="\t", na.strings = ": ", 
                                header = TRUE, stringsAsFactors = FALSE))
#   user  system elapsed 
#  1.23    0.00    1.25 

# with readr:
system.time({
  n <- ncol(readr::read_tsv(tfile, n_max = 1))
  dat2 <- readr::read_tsv(tfile, na = ": ", 
                          col_types = c(list(readr::col_character()),
                                        rep(list(readr::col_numeric()), 
                                            times = n-1)))
  })
#    user  system elapsed 
#   0.94    0.00    0.93 
unlink(tfile)

Arguments "filters" and "type" do not appear when using get_eurostat()

Hi,

I am trying to use the function get_eurostat(); however, when I try to pass the arguments "type" and/or "filters", I get errors.

For example:
Error in get_eurostat(id, filters = list(geo = c("EU28", "FI"), lastTimePeriod = 1), :
unused arguments (filters = list(geo = c("EU28", "FI"), lastTimePeriod = 1), type = "label")

AND

Error in get_eurostat(id, filters = list(geo = c("EU28", "FI"), lastTimePeriod = 1), :
unused arguments (filters = list(geo = c("EU28", "FI"), lastTimePeriod = 1), type = "label")

Any ideas what's going on?
Apologies if this is a rookie mistake. I am still a novice at R.

Thanks!

label_eurostat

Would be useful to keep both label and original id in the output table from label_eurostat ?

reurostat

Check connections to https://github.com/Tungurahua/reurostat

Vignette TOC

I added TOC with links in Rmarkdown vignette. I think there was some way to generate that automatically. Investigate and add if possible, easier to maintain then. Not urgent.

getEurostatDictionary fails to read "table_dic" properly

I get flawed data.frame when reading table_dic-dictionary with getEurostatDictionary. Some rows of the V2 column ends up holding several rows of the dictionary.

To reproduce the problem:
dict <- getEurostatDictionary("table_dic")
dict[40,]

I have fixed the issue in:
jhuovari@d3d028c
with quote = """ argument.

Eurostat open data portal?

I think referring to the Eurostat open data and open data portal is bit misleading. Eurostat do have the Open data portal (https://open-data.europa.eu/en/) and it's have an API (https://open-data.europa.eu/en/developerscorner). The package, however, do get its data from the Eurostat main database and the reference on README is for the Eurostat main page. The open portal do get its data from the same source, but to refer to the Open data portal, we should use their API. I think the CRAN was also complaining about this also on their comment.

The Description also has "together with analysis and visualization utilities." I don't think we really have any of those, at least yet.

cran warning

Dear maintainer,

This currently gives

checking re-building of vignette outputs ... WARNING
Error in re-building vignettes:
...
Loading required package: xml2
trying URL ‘http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownload
Listing?sort=1&file=data%2Ftsdtr210.tsv.gz’
Content type 'application/octet-stream;charset=UTF-8' length 4001 bytes
downloaded 4001 bytes

Table tsdtr210 cached at /tmp/Rtmp47Gfju/eurostat/tsdtr210_num_code_TF.rds
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
:
invalid input found on input connection ‘http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=dic%2Fen%2Fgeo.dic’
Quitting from lines 160-162 (eurostat_tutorial.Rmd)
Error: processing vignette ‘eurostat_tutorial.Rmd’ failed with diagnostics:
factor level [4] is duplicated
Execution halted

Can you pls fix as necessary?

Switch to data_frames

The data_frame format of the tibble package is useful update to data.frame: https://blog.rstudio.org/2016/03/24/tibble-1-0-0/ It could be useful to switch the output formats from data.frame to data_frame.

Issue with negative values (being converted to absolute values)

Hello,
Thank you so much for the very useful package! It's been a great help.

However, I encountered a problem when downloading eurostat series with negative values. In R these don't get recorded as negative but as positive (i.e. the absolute value).

This happened to me when downloading tables [nama_nace31_k] & [nama_nace21_k]. If you choose variable indic_na="B1G" & unit="PCH_PRE", i.e percentage change in gross value added, all the values are larger than 0. This is even though the volume series in levels exhibit also decreases from time to time.

What am I doing wrong or is there some kind of bug?

Many thanks!

memory use

downloading a 4.2mb datafile, the package takes a long time, consumes all of the available (16Gb) memory. Tested with R 3.2.5 and 3.3.0 within R studio.
Code:
library(eurostat)
search_eurostat("asylum")
df1 <- get_eurostat("migr_asyappctzm")

Factor levels in Eurostat order

I changed the ordering of factor levels in get_eurostat(). By defaul the factor() does sort the levels, but I think it's better to use original Eurostat ordering. The Eurostat also usually uses alphabetically ordered labels, but when they are not in that order, there is reason for that. Better follow the Eurostat reasoning.

Implemented in: 66d4533

get_eurostat() fails with single variable dataset

Issue reported by Marie Trotta at louhos googlegroup:

I have a question concerning the package “Eurostat” in R
I am trying to get the table tps00003 but I got an error message
The package doesn’t know how to deal with the raw_id and can not differenciate properly entities.
Do you have any idea about how I could solve this issue?

Thank you

k <- get_eurostat("tps00003", time_format="raw", cache = FALSE)
trying URL 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Ftps00003.tsv.gz'
Content type 'application/octet-stream;charset=UTF-8' length 1187 bytes
downloaded 1187 bytes

Warning message:
In `[<-.data.frame`(`*tmp*`, , cnames1, value = list(1L, 1L, 1L,  :
  37 variables sont fournies pour remplacer 1 variables

It is due to missing drop = FALSE in tidy_eurostat().

UTF8 encoding problems with statfi data in Windows

get_eurostat() is not working properly on Windows with non UTF-8 encoding,
there is a problem with conversion from px class to data.frame in the statfi package.

stringsAsFactors = FALSE does not affect time

tidy_eurostat(x, time_format = "raw", stringsAsFactors = FALSE) returns time variable in factor format. Comes from row 77 where gather_ converts to factors. It seems, however, that this behaviour will change in next release of tidyr (see: tidyverse/tidyr#96).

So, I leave it as it in now and let tidyr update take are of this. But, perhaps we should add a conversion of the time (raw) to factors after that?

Eurostat JSON

Eurostat have also API to get data in a JSON format. That is handy if you wan't just a part of a dataset, as you can filter the data before downloading.

This should not be a replacement for existing method, as there is a size limit for API calls, but an addition.

Below is an quick demonstration to get JSON data:

get_eurostat_json <- function(dataset, filters = NULL){
  url_list <- list(scheme = "http",
                        hostname = "ec.europa.eu/eurostat/wdds/rest/data/v1.1/json/en",
                        path = dataset,
                        query = filters)
  class(url_list) <- "url"
  url <- httr::build_url(url_list)
  jdat <- jsonlite::fromJSON(url)
  dims <- jdat[[1]]$dimension
  ids <- dims$id

  dims_list <- lapply(dims[rev(ids)], function(x){
    unlist(x$category$label)
  })

  variables <- expand.grid(dims_list, KEEP.OUT.ATTRS = FALSE)

  dat <- data.frame(variables[rev(names(variables))], values = jdat[[1]]$value)
  dat
}

y <- get_eurostat_json("cdh_e_fos")
yy <- get_eurostat_json("nama_gdp_c", filters = list(geo="EU28",
                                                     unit="EUR_HAB",
                                                     indic_na="B1GM"))

Unit tests

Would be good to add some unit tests.

ggplot2 dev

I see the following problems:

checking files in ‘vignettes’ ... NOTE
The following directory looks like a leftover from 'knitr':
  ‘figure’
Please remove from your package.

JSON error

prod <- get_eurostat("sts_inpr_a", filters = list(geo = "AT"))
Error in get_eurostat_json(id, filters, type = type, stringsAsFactors = stringsAsFactors, :
Failure to get data. Status code: 416

"sts_inpr_a" stands for Short-TermBusinessStatistics_IndustryProduction_annual
Without the filters argument, it works perfectly fine

get_eurostat

Hello,

When I use get_eurostat, I get the following memory problem :

dat4 <- get_eurostat("bop_c6_q", select_time="Q")
trying URL 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fbop_c6_q.tsv.gz'
Content type 'application/octet-stream;charset=UTF-8' length 36153672 bytes (34.5 MB)
downloaded 34.5 MB
Error: cannot allocate vector of size 396.9 Mb

So I have tried using the filter but it doesn't work either :

dat5 <- get_eurostat("bop_c6_q", select_time="Q", time_format = "num", filters = list(currency="MIO_EUR", bop_item="FA", sector10="S1", sectpart="S1", stk_flow="NET", partner="WRL_REST", geo="FR"))
Error in get_eurostat_json(id, filters, type = type, stringsAsFactors = stringsAsFactors, :
Failure to get data. Probably filters did not return any data
or data exceeded query size limitation. Status code: 500

Can you help me ?

Thank you very much

Base url

Define base_url function that can be called from all places where needed and changed at once if needed.

encoding problem with get_eurostat_dic

There is an issue in get_eurostat_dict on systems with an encoding different from "Windows-1252". At least on linux systems where I tried it. I was able to resolve the issue overriding get_eurostat_dict and configuring the proper encoding (UTF-8).

get_eurostat

When using get_eurostat I receive an error. The search for the id as well as the download works. But assigning the data to the data.frame gives an error.

id <- search_eurostat("Modal split of passenger transport", 
                  type = "table")$code[1]
print(id)

[1] "tsdtr210"

dat <- get_eurostat(id, time_format = "num")

Error in $<-.data.frame(*tmp*, "values", value = numeric(0)) :
Replacement has 0 rows, data has 2145

Any help is appreciated!

get_eurostat() drop units with all observations NA

The get_eurostat() seems drop those variable combinations where all time observations are NA.
Noticed that with get_eurostat_json():

> dim(get_eurostat_json("cdh_e_fos"))
[1] 672   6
> dim(get_eurostat("cdh_e_fos"))
[1] 400   6

Should we fix it? Or leave as it is? In practise probably do not matter that much. And factors have all levels anyway. However, if we use get_eurostat() as a general wrapper, it should return same data.frame both ways.

.env

For clarity move .SmarterPoland env into eurostat env.

get_eurostat: separate_{tidyr} with convert = TRUE converts T to TRUE instead of TOTAL

In Eurostat value T stands for Total,
but separate_(convert = TRUE) in get_eurostat() causes T being translated to TRUE what might be confusing.

Suggestion: do not allow for automatic conversion and set
convert = FALSE

an example:
head(t1 <- get_eurostat("tsdtr420"))

Problem with dev tidyr

checking files in ‘vignettes’ ... NOTE
The following directory looks like a leftover from 'knitr':
  ‘figure’
Please remove from your package.

checking re-building of vignette outputs ... NOTE
Error in re-building vignettes:
  ...
trying URL 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Ftsdtr210.tsv.gz'
Content type 'application/octet-stream;charset=UTF-8' length 4001 bytes
==================================================
downloaded 4001 bytes

Quitting from lines 92-93 (eurostat_tutorial.Rmd) 
Error: processing vignette 'eurostat_tutorial.Rmd' failed with diagnostics:
argument is of length zero
Execution halted

DONE
Status: 2 NOTEs

Could you please let me know if this is a bug in the dev version of tidyr?

Label dictionaries cache

Could be implemented same way as the TOC cache is done with the set_eurostat_toc()

Datasets with mixed time format get confused dates

Some of the datasets includes mix of time formats, like "avia_goincc". In same dataset there is annual, quarterly and monthly data. With time_formats "date" or "num" they get mixed in output.

Should implement selector for A, Q and M data. And also a warning to use time_format = "raw" if not selected.

Startup message

Is it really needed? Do users need to see that information every time they attach the package?

The format of the time column in the get_eurostat output

At the moment the time column of the data.frame from get_eurostat is character string. That's not very convenient. I would rather have it in numeric or date format. For a yearly data a numeric would be good, but for a quarterly and monthly data the best would be dates. So, to have always the same format, I think it should be dates.

But would it be the first day or the last day of the period? I think the first day would be the clearest choice, but at least the Quandl-package uses the last day.

I think the same problem applies also to the pxweb package, and I would be good to have the same solution for both.

What do you think?

Problem with label_eurostat

Hi everyone,

My colleague told me about this package for extracting data from eurostat and I really love it. Strangely enough though, there seems to be something wrong when I try to label a dataset. Even though I'm using the exact same code that my colleague shared with me (and for him it does work):

toc <- get_eurostat_toc()

data_mar <- search_eurostat("Maritime", type = "dataset")

id <- "mar_go_qm_nl"
dat <- get_eurostat(id, time_format = "date")

datl <- label_eurostat(dat)

I was using R 3.2.0 and updated to 3.2.2 but it didn't matter. Does anyone know what the problem is?

Best regards,
Niek

Rename grepEurostatTOC to search_eurostat_data?

Would it be clearer? grep does not probably tell everyone what the function does.

Exceptions to ISO 3166-1 alpha-2

There is an issue in your demo file and this has to be also considered in the context of package "countrycode" and "eurostat". The European Commission and the Eurostat generally uses ISO 3166-1 alpha-2 codes with two exceptions: EL (not GR) is used to represent Greece, and UK (not GB) is used to represent the United Kingdom.

These exceptions basically touch upon any Eurostat tables. I think this special issue should be addressed either in eurostat or in countrycode, but I think this is not a universal issue, so is should be used in packages concerning European Commission / Eurostat data access.

While the issue is very simple to handle in program code, it is not very easy to detect. It took me quiet some time to find out about this, as I kept loosing some data values.