Coder Social home page Coder Social logo

censobr's Introduction

censobr: Download Data from Brazil's Population Census logo

CRAN status CRAN/METACRAN Total downloads Codecov test coverage Lifecycle: maturing R-CMD-check

censobr is an R package to download data from Brazil's Population Census. The package is built on top of the Arrow platform, which allows users to work with larger-than-memory census data using {dplyr} familiar functions.

Installation

# install from CRAN
install.packages("censobr")

# or use the development version with latest features
utils::remove.packages('censobr')
remotes::install_github("ipeaGIT/censobr", ref="dev")
library(censobr)

Basic usage

The package currently includes 6 main functions to download & read census data:

  1. read_population()
  2. read_households()
  3. read_mortality()
  4. read_families()
  5. read_emigration()
  6. read_tracts()

censobr also includes a few support functions to help users navigate the documentation Brazilian censuses, providing convenient information on data variables and methodology:

  1. data_dictionary()
  2. questionnaire()
  3. interview_manual()

Finally, the package includes two functions to help users manage the data chached locally.

  1. censobr_cache()
  2. set_censobr_cache_dir()

The syntax of all censobr functions to read data operate on the same logic so it becomes intuitive to download any data set using a single line of code. Like this:

read_households(
  year,          # year of reference
  columns,       # select columns to read
  add_labels,    # add labels to categorical variables
  as_data_frame, # return an Arrow DataSet or a data.frame
  showProgress,  # show download progress bar
  cache          # cache data for faster access later
  )

Note: all data sets in censobr are enriched with geography columns following the name standards of the {geobr} package to help data manipulation and integration with spatial data from {geobr}. The added columns are: c(‘code_muni’, ‘code_state’, ‘abbrev_state’, ‘name_state’, ‘code_region’, ‘name_region’, ‘code_weighting’).

Data cache

The first time the user runs a function, censobr will download the file and store it locally. This way, the data only needs to be downloaded once. When the cache parameter is set to TRUE (Default), the function will read the cached data, which is much faster.

  • censobr_cache(): can be used to list and/or delete data files cached locally
  • set_censobr_cache_dir(): can be used to set custom cache directory for censobr files

Larger-than-memory Data

Microdata of Brazilian census are often be too big to load in users' RAM memory. To avoid this problem, censobr will by default return an Arrow table, which can be analyzed like a regular data.frame using the dplyr package without loading the full data to memory.

More info in the package vignette.

Contributing to censobr

If you would like to contribute to censobr, you're welcome to open an issue to explain the proposed a contribution.


Related projects

Afaik, censobr is the only R package that provides fast and convenient access to data of Brazilian censuses. The microdadosBrasil package used to provide access to microdata of several public data sets, but unfortunately, it has been discontinued.

Similar packages for other countries

Credits IPEA

Original Census data is collected by the Brazilian Institute of Geography and Statistics (IBGE). The censobr package is developed by a team at the Institute for Applied Economic Research (Ipea), Brazil. If you want to cite this package, you can cite it as:

bibentry(
  bibtype  = "Manual",
  title       = "censobr: Download Data from Brazil's Population Census",
  author      = "Rafael H. M. Pereira [aut, cre] and Rogério J. Barbosa [aut]",
  year        = 2023,
  version     = "v0.2.0",
  url         = "https://CRAN.R-project.org/package=censobr",
  textVersion = "Pereira, R. H. M.; Barbosa, R. J. (2023) censobr: Download Data from Brazil's Population Census. R package version v0.2.0, <https://CRAN.R-project.org/package=censobr>."
)

censobr's People

Contributors

diraol avatar nealrichardson avatar rafapereirabr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

censobr's Issues

Variáveis presente no dicionário de dados

Olá,

Estou utilizando o pacote censobr para carregar os dados de população do ano 2000. Notei que a variável "Sexo do último filho nascido vivo" está presente no dicionário de dados (V0464), mas não aparece nos dados efetivamente carregados. Abaixo está o código que utilizei:

library(censobr)

# Carregar dados de população do ano 2000
df3 <- read_population(year = 2000,
                      showProgress = FALSE)

# Verificar se a variável está presente nos dados
colnames(df3)

Pode me ajudar, por favor?

Missing municipality id in 1991 population table

Hi @rafapereirabr !

I have used the population and household tables from 2010 to 1980. In 1991 I found a problem in the identification code column of the municipality of the observation. I don't know if I'm doing something wrong, but of the 17,045,653 observations, 8,575,800 are missing from this column. I was surprised when I couldn't filter the information for the municipality of Rio de Janeiro. Here's the code I used to test it:

> census1991_2 <- read_population(year = 1991, cache = T)
Reading data cached locally.
> census1991_2 |> 
+   select(code_muni) |> 
+   mutate(test = is.na(code_muni)) |> 
+   count(test) |>
+   collect()
# A tibble: 2 × 2
  test        n
  <lgl>   <int>
1 FALSE 8469853
2 TRUE  8575800

Add 2022 Census data

  • microdata - population
  • microdata - households
  • census-tract level data
  • data dictionary
  • questionnaire
  • interview_manual

Persistent error in Github Actions macOS-latest (oldrel)

The package currently passes in every check when tested locally. It also passes the tests in Github Actions in every OS, except for macOS-latest (oldrel). Here's the output of GHA, rather difficult to interpret, tbh.

The error occurs when building the vignette. I've tried removing the vignette entirely, and all checks passed. See this.

Run options(crayon.enabled = TRUE)
── R CMD build ─────────────────────────────────────────────────────────────────
pdflatex not found! Not building PDF manual.
* checking for file ‘.../DESCRIPTION’ ... OK
* preparing ‘censobr’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ...sh: line 1:  5061 Illegal instruction: 4  '/Library/Frameworks/R.framework/Resources/bin/Rscript' --vanilla --default-packages= -e "tools::buildVignettes(dir = '.', tangle = TRUE)" > '/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//RtmpnQbntJ/xshell[13](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:14)95ed07180' 2>&1
 ERROR
--- re-building ‘censobr.Rmd’ using rmarkdown
 *** caught illegal operation ***
address 0x1120d8a63, cause 'illegal opcode'
Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(x)
12: collect.arrow_dplyr_query(filter(pop, abbrev_state == "RJ"))
13: collect(filter(pop, abbrev_state == "RJ"))
[14](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:15): group_by(collect(filter(pop, abbrev_state == "RJ")), V0606)
[15](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:16): summarize(group_by(collect(filter(pop, abbrev_state == "RJ")),     V0606), higher_edu = sum(V0010[which(V6400 == 4)])/sum(V0010),     pop = sum(V0010))
[16](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:17): collect(summarize(group_by(collect(filter(pop, abbrev_state ==     "RJ")), V0606), higher_edu = sum(V0010[which(V6400 == 4)])/sum(V0010),     pop = sum(V0010)))
[17](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:18): eval(expr, envir, enclos)
[18](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:19): eval(expr, envir, enclos)
[19](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:20): eval_with_user_handlers(expr, envir, enclos, user_handlers)
[20](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:21): withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers))
[21](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:22): withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler)
[22](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:23): handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler))
[23](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:24): timing_fn(handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler)))
[24](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:25): evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos,     debug = debug, last = i == length(out), use_try = stop_on_error !=         2L, keep_warning = keep_warning, keep_message = keep_message,     log_echo = log_echo, log_warning = log_warning, output_handler = output_handler,     include_timing = include_timing)
[25](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:26): evaluate::evaluate(...)
[26](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:27): evaluate(code, envir = env, new_device = FALSE, keep_warning = if (is.numeric(options$warning)) TRUE else options$warning,     keep_message = if (is.numeric(options$message)) TRUE else options$message,     stop_on_error = if (is.numeric(options$error)) options$error else {        if (options$error && options$include)             0L        else 2L    }, output_handler = knit_handlers(options$render, options))
[27](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:28): in_dir(input_dir(), expr)
[28](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:29): in_input_dir(evaluate(code, envir = env, new_device = FALSE,     keep_warning = if (is.numeric(options$warning)) TRUE else options$warning,     keep_message = if (is.numeric(options$message)) TRUE else options$message,     stop_on_error = if (is.numeric(options$error)) options$error else {        if (options$error && options$include)             0L        else 2L    }, output_handler = knit_handlers(options$render, options)))
[29](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:30): eng_r(options)
[30](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:31): block_exec(params)
[31](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:32): call_block(x)
[32](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:33): process_group.block(group)
[33](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:34): process_group(group)
[34](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:35): withCallingHandlers(if (tangle) process_tangle(group) else process_group(group),     error = function(e) if (xfun::pkg_available("rlang", "1.0.0") &&         !xfun::check_old_package("learnr", "0.11.3")) rlang::entrace(e))
[35](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:36): withCallingHandlers(withCallingHandlers(if (tangle) process_tangle(group) else process_group(group),     error = function(e) if (xfun::pkg_available("rlang", "1.0.0") &&         !xfun::check_old_package("learnr", "0.11.3")) rlang::entrace(e)),     error = function(e) {        setwd(wd)        write_utf8(res, output %n% stdout())        message("\nQuitting from lines ", paste(current_lines(i),             collapse = "-"), if (labels[i] != "")             sprintf(" [%s]", labels[i]), sprintf(" (%s)", knit_concord$get("infile")))    })
[36](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:37): process_file(text, output)
[37](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:38): knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
[38](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:39): rmarkdown::render(file, encoding = encoding, quiet = quiet, envir = globalenv(),     output_dir = getwd(), ...)
[39](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:40): vweave_rmarkdown(...)
[40](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:41): engine$weave(file, quiet = quiet, encoding = enc)
[41](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:42): doTryCatch(return(expr), name, parentenv, handler)
[42](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:43): tryCatchOne(expr, names, parentenv, handlers[[1L]])
[43](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:44): tryCatchList(expr, classes, parentenv, handlers)
[44](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:45): tryCatch({    engine$weave(file, quiet = quiet, encoding = enc)    setwd(startdir)    output <- find_vignette_product(name, by = "weave", engine = engine)    if (!have.makefile && vignette_is_tex(output)) {        texi2pdf(file = output, clean = FALSE, quiet = quiet)        output <- find_vignette_product(name, by = "texi2pdf",             engine = engine)    }    outputs <- c(outputs, output)}, error = function(e) {    thisOK <<- FALSE    fails <<- c(fails, file)    message(gettextf("Error: processing vignette '%s' failed with diagnostics:\n%s",         file, conditionMessage(e)))})
[45](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:46): tools::buildVignettes(dir = ".", tangle = TRUE)
An irrecoverable exception occurred. R is aborting now ...
Error: Error in proc$get_built_file() : Build process failed
Calls: <Anonymous> ... build_package -> with_envvar -> force -> <Anonymous>
Execution halted
Error: Process completed with exit code 1.

New function to add labels to variables

Initially, here's the idea. One function per data set

  • add_labels_households(arrw, lang = c('PT', 'EN'))
  • add_labels_population(arrw, lang = c('PT', 'EN'))
  • add_labels_mortality(arrw, lang = c('PT', 'EN'))

Depending on how it goes, it might be better to have a single function that applies to different data sets. E.g.

  • add_labels(arrw, dataset = c('households', 'population', 'mortality'), lang = c('PT', 'EN'))

The downside of this first approach is having too many functions, code repetition because of some variables that are in common between datasets. Meanwhile, the downside of the second approach is that the function will be too big, and harder to manage.

Add microdata documentation

  • 1970 - Available in the dev version. Planned for v0.2.0
  • 1980 - Available in the dev version. Planned for v0.2.0
  • 1991 - Available in the dev version. Planned for v0.2.0
  • 2000 - Available in the dev version. Planned for v0.2.0
  • 2010 - Available in the dev version. Planned for v0.2.0

add merge_households parameter

add a merge_households (logical) parameter to indicate whether the function should merge household variables to the output data.

  • 1970 population
  • 1980 population
  • 1991 population
  • 2000 population
  • 2000 families
  • 2010 population
  • 2010 emigration
  • 2010 mortality

Add 2000 and 2010 microdata

  • 2000
    • households (ready for v0.2.0)
    • population (ready for v0.2.0)
    • families (ready for v0.2.0)
  • 2010
    • households (ready for v0.2.0)
    • population (ready for v0.2.0)
    • deaths (ready for v0.2.0)
    • emmigration (ready for v0.2.0)

Censo 2022

Tem alguma previsão para o censo 2022 seja incluído no pacote?

Problema no dicionário households

Estava utilizando o read_households e acabei encontrando um problema no dicionário que mudava o que eu desejava no código.

` #Dicionário utilizado
data_dictionary(year = 2010, dataset = "households", showProgress = TRUE, cache = TRUE)

#Código utilizado

domicilio <- read_households(year = 2010,
columns = NULL,
add_labels = NULL,
as_data_frame = TRUE,
showProgress = TRUE,
cache = TRUE)

#sexo
domicilio2 <- select(domicilio, code_weighting, V0201, V1005) #a coluna 1005 contêm 1 - Masculino, 2 - Feminino. Deveria conter o tipo de área da moradia (urbana, rural e etc).

nomecolunas<-c("Cod_setor","Morador_Alugado","Sexo")
colnames(domicilio2)<-nomecolunas
as.character(domicilio2$Cod_setor)
rm(domicilio2) `

Add 1960 Census data

  • microdata - population
  • microdata - households
  • census-tract level data
  • data dictionary
  • questionnaire
  • interview_manual

Add questionnaires

running the function questionnaire() will open the pdf of a questionnaire on the web browser.

  • 1970 - Available in the dev version. Planned for v0.2.0
  • 1980 - Available in the dev version. Planned for v0.2.0
  • 1991 - Available in the dev version. Planned for v0.2.0
  • 2000 - Available in the dev version. Planned for v0.2.0
  • 2010 - Available in the dev version. Planned for v0.2.0

Add tests to the questionnaire() function, and create vignette

  • tests
  • vignette

Improve code coverage

as of 06/Sept/2023

censobr Coverage: 28.31%
R/add_labels_emigration.R: 4.91%
R/add_labels_households.R: 14.89%
R/add_labels_population.R: 23.85%
R/add_labels_families.R: 38.24%
R/add_labels_mortality.R: 44.74%
R/read_families.R: 90.48% R/read_households.R: 90.48%
R/read_population.R: 90.48%
R/censobr_cache.R: 94.44% R/read_emigration.R: 95.24%
R/read_mortality.R: 95.24%

Add Interview manual

running the function interview_manual() will open the pdf of an interview manual on the web browser.

  • 1970 - Available in the dev version. Planned for v0.2.0
  • 1980 - Available in the dev version. Planned for v0.2.0
  • 1991 - Available in the dev version. Planned for v0.2.0
  • 2000 - Available in the dev version. Planned for v0.2.0
  • 2010 - Available in the dev version. Planned for v0.2.0

Add tests to the interview_manual() function, and create vignette

  • tests
  • vignette

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.