ipeagit / censobr Goto Github PK

Home Page: https://ipeagit.github.io/censobr/

License: Other

R 100.00%

brazil census census-data microdados microdata r rstats

censobr's Introduction

censobr: Download Data from Brazil's Population Census

censobr is an R package to download data from Brazil's Population Census. The package is built on top of the Arrow platform, which allows users to work with larger-than-memory census data using {dplyr} familiar functions.

Installation

# install from CRAN
install.packages("censobr")

# or use the development version with latest features
utils::remove.packages('censobr')
remotes::install_github("ipeaGIT/censobr", ref="dev")
library(censobr)

Basic usage

The package currently includes 6 main functions to download & read census data:

read_population()
read_households()
read_mortality()
read_families()
read_emigration()
read_tracts()

censobr also includes a few support functions to help users navigate the documentation Brazilian censuses, providing convenient information on data variables and methodology:

data_dictionary()
questionnaire()
interview_manual()

Finally, the package includes two functions to help users manage the data chached locally.

censobr_cache()
set_censobr_cache_dir()

The syntax of all censobr functions to read data operate on the same logic so it becomes intuitive to download any data set using a single line of code. Like this:

read_households(
  year,          # year of reference
  columns,       # select columns to read
  add_labels,    # add labels to categorical variables
  as_data_frame, # return an Arrow DataSet or a data.frame
  showProgress,  # show download progress bar
  cache          # cache data for faster access later
  )

Note: all data sets in censobr are enriched with geography columns following the name standards of the {geobr} package to help data manipulation and integration with spatial data from {geobr}. The added columns are: c(‘code_muni’, ‘code_state’, ‘abbrev_state’, ‘name_state’, ‘code_region’, ‘name_region’, ‘code_weighting’).

Data cache

The first time the user runs a function, censobr will download the file and store it locally. This way, the data only needs to be downloaded once. When the cache parameter is set to TRUE (Default), the function will read the cached data, which is much faster.

censobr_cache(): can be used to list and/or delete data files cached locally
set_censobr_cache_dir(): can be used to set custom cache directory for censobr files

Larger-than-memory Data

Microdata of Brazilian census are often be too big to load in users' RAM memory. To avoid this problem, censobr will by default return an Arrow table, which can be analyzed like a regular data.frame using the dplyr package without loading the full data to memory.

More info in the package vignette.

Contributing to censobr

If you would like to contribute to censobr, you're welcome to open an issue to explain the proposed a contribution.

Related projects

Afaik, censobr is the only R package that provides fast and convenient access to data of Brazilian censuses. The microdadosBrasil package used to provide access to microdata of several public data sets, but unfortunately, it has been discontinued.

Similar packages for other countries

Canada: cancensus
Chile: censo2017
US: tidycensus
World: ipumsr

Credits

Original Census data is collected by the Brazilian Institute of Geography and Statistics (IBGE). The censobr package is developed by a team at the Institute for Applied Economic Research (Ipea), Brazil. If you want to cite this package, you can cite it as:

Pereira, Rafael H. M.; Barbosa, Rogério J. (2023) censobr: Download Data from Brazil's Population Census. R package version v0.2.0, https://CRAN.R-project.org/package=censobr.

bibentry(
  bibtype  = "Manual",
  title       = "censobr: Download Data from Brazil's Population Census",
  author      = "Rafael H. M. Pereira [aut, cre] and Rogério J. Barbosa [aut]",
  year        = 2023,
  version     = "v0.2.0",
  url         = "https://CRAN.R-project.org/package=censobr",
  textVersion = "Pereira, R. H. M.; Barbosa, R. J. (2023) censobr: Download Data from Brazil's Population Census. R package version v0.2.0, <https://CRAN.R-project.org/package=censobr>."
)

censobr's People

Contributors

Stargazers

Watchers

Forkers

fabiocosta0305 diraol marionog nealrichardson

censobr's Issues

Variáveis presente no dicionário de dados

Olá,

Estou utilizando o pacote censobr para carregar os dados de população do ano 2000. Notei que a variável "Sexo do último filho nascido vivo" está presente no dicionário de dados (V0464), mas não aparece nos dados efetivamente carregados. Abaixo está o código que utilizei:

library(censobr)

# Carregar dados de população do ano 2000
df3 <- read_population(year = 2000,
                      showProgress = FALSE)

# Verificar se a variável está presente nos dados
colnames(df3)

Pode me ajudar, por favor?

error in macOS: IOException

Error with {duckplyr} and {duckdb} when performinn left joins used in merge_households = TRUE. Issue reported in duckdblabs/duckplyr#203

Missing municipality id in 1991 population table

Hi @rafapereirabr !

I have used the population and household tables from 2010 to 1980. In 1991 I found a problem in the identification code column of the municipality of the observation. I don't know if I'm doing something wrong, but of the 17,045,653 observations, 8,575,800 are missing from this column. I was surprised when I couldn't filter the information for the municipality of Rio de Janeiro. Here's the code I used to test it:

> census1991_2 <- read_population(year = 1991, cache = T)
Reading data cached locally.
> census1991_2 |> 
+   select(code_muni) |> 
+   mutate(test = is.na(code_muni)) |> 
+   count(test) |>
+   collect()
# A tibble: 2 × 2
  test        n
  <lgl>   <int>
1 FALSE 8469853
2 TRUE  8575800

Harmonize variables across censuses

Add 1980 microdata

Add labels to microdata in year 1991

Population dataset
Households dataset

Add 1970 microdata

Add 2022 Census data

Removing data from previous data releases

censobr_cache(delete_file = 'all') should also remove data from old data releases

Add labels to Population microdata in year 2000

using cache_dir and data_release as global variables

New function for Complex Survey Design Object

Persistent error in Github Actions macOS-latest (oldrel)

The package currently passes in every check when tested locally. It also passes the tests in Github Actions in every OS, except for macOS-latest (oldrel). Here's the output of GHA, rather difficult to interpret, tbh.

The error occurs when building the vignette. I've tried removing the vignette entirely, and all checks passed. See this.

Run options(crayon.enabled = TRUE)
── R CMD build ─────────────────────────────────────────────────────────────────
pdflatex not found! Not building PDF manual.
* checking for file ‘.../DESCRIPTION’ ... OK
* preparing ‘censobr’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ...sh: line 1:  5061 Illegal instruction: 4  '/Library/Frameworks/R.framework/Resources/bin/Rscript' --vanilla --default-packages= -e "tools::buildVignettes(dir = '.', tangle = TRUE)" > '/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//RtmpnQbntJ/xshell[13](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:14)95ed07180' 2>&1
 ERROR
--- re-building ‘censobr.Rmd’ using rmarkdown
 *** caught illegal operation ***
address 0x1120d8a63, cause 'illegal opcode'
Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 4)) {    augment_io_error_msg(e, call, schema = schema())})
11: compute.arrow_dplyr_query(x)
12: collect.arrow_dplyr_query(filter(pop, abbrev_state == "RJ"))
13: collect(filter(pop, abbrev_state == "RJ"))
[14](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:15): group_by(collect(filter(pop, abbrev_state == "RJ")), V0606)
[15](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:16): summarize(group_by(collect(filter(pop, abbrev_state == "RJ")),     V0606), higher_edu = sum(V0010[which(V6400 == 4)])/sum(V0010),     pop = sum(V0010))
[16](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:17): collect(summarize(group_by(collect(filter(pop, abbrev_state ==     "RJ")), V0606), higher_edu = sum(V0010[which(V6400 == 4)])/sum(V0010),     pop = sum(V0010)))
[17](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:18): eval(expr, envir, enclos)
[18](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:19): eval(expr, envir, enclos)
[19](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:20): eval_with_user_handlers(expr, envir, enclos, user_handlers)
[20](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:21): withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers))
[21](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:22): withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler)
[22](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:23): handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler))
[23](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:24): timing_fn(handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler)))
[24](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:25): evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos,     debug = debug, last = i == length(out), use_try = stop_on_error !=         2L, keep_warning = keep_warning, keep_message = keep_message,     log_echo = log_echo, log_warning = log_warning, output_handler = output_handler,     include_timing = include_timing)
[25](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:26): evaluate::evaluate(...)
[26](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:27): evaluate(code, envir = env, new_device = FALSE, keep_warning = if (is.numeric(options$warning)) TRUE else options$warning,     keep_message = if (is.numeric(options$message)) TRUE else options$message,     stop_on_error = if (is.numeric(options$error)) options$error else {        if (options$error && options$include)             0L        else 2L    }, output_handler = knit_handlers(options$render, options))
[27](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:28): in_dir(input_dir(), expr)
[28](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:29): in_input_dir(evaluate(code, envir = env, new_device = FALSE,     keep_warning = if (is.numeric(options$warning)) TRUE else options$warning,     keep_message = if (is.numeric(options$message)) TRUE else options$message,     stop_on_error = if (is.numeric(options$error)) options$error else {        if (options$error && options$include)             0L        else 2L    }, output_handler = knit_handlers(options$render, options)))
[29](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:30): eng_r(options)
[30](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:31): block_exec(params)
[31](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:32): call_block(x)
[32](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:33): process_group.block(group)
[33](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:34): process_group(group)
[34](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:35): withCallingHandlers(if (tangle) process_tangle(group) else process_group(group),     error = function(e) if (xfun::pkg_available("rlang", "1.0.0") &&         !xfun::check_old_package("learnr", "0.11.3")) rlang::entrace(e))
[35](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:36): withCallingHandlers(withCallingHandlers(if (tangle) process_tangle(group) else process_group(group),     error = function(e) if (xfun::pkg_available("rlang", "1.0.0") &&         !xfun::check_old_package("learnr", "0.11.3")) rlang::entrace(e)),     error = function(e) {        setwd(wd)        write_utf8(res, output %n% stdout())        message("\nQuitting from lines ", paste(current_lines(i),             collapse = "-"), if (labels[i] != "")             sprintf(" [%s]", labels[i]), sprintf(" (%s)", knit_concord$get("infile")))    })
[36](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:37): process_file(text, output)
[37](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:38): knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
[38](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:39): rmarkdown::render(file, encoding = encoding, quiet = quiet, envir = globalenv(),     output_dir = getwd(), ...)
[39](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:40): vweave_rmarkdown(...)
[40](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:41): engine$weave(file, quiet = quiet, encoding = enc)
[41](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:42): doTryCatch(return(expr), name, parentenv, handler)
[42](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:43): tryCatchOne(expr, names, parentenv, handlers[[1L]])
[43](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:44): tryCatchList(expr, classes, parentenv, handlers)
[44](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:45): tryCatch({    engine$weave(file, quiet = quiet, encoding = enc)    setwd(startdir)    output <- find_vignette_product(name, by = "weave", engine = engine)    if (!have.makefile && vignette_is_tex(output)) {        texi2pdf(file = output, clean = FALSE, quiet = quiet)        output <- find_vignette_product(name, by = "texi2pdf",             engine = engine)    }    outputs <- c(outputs, output)}, error = function(e) {    thisOK <<- FALSE    fails <<- c(fails, file)    message(gettextf("Error: processing vignette '%s' failed with diagnostics:\n%s",         file, conditionMessage(e)))})
[45](https://github.com/ipeaGIT/censobr/actions/runs/6055327169/job/16434102385#step:11:46): tools::buildVignettes(dir = ".", tangle = TRUE)
An irrecoverable exception occurred. R is aborting now ...
Error: Error in proc$get_built_file() : Build process failed
Calls: <Anonymous> ... build_package -> with_envvar -> force -> <Anonymous>
Execution halted
Error: Process completed with exit code 1.

update files of 2010 census tract data - Goias

IBGE updated the census tract files for the state of Goias on Oct/2023. See this.

Enrich data sets with basic geography columns following geobr name standards

The following columns with basic geography info will be added to data sets. This should help data manipulation and integration with spatial data from the {geobr} package.

code_muni
code_state
abbrev_state
name_state
code_region
name_region
code_weighting

Add parameter `geometry` to function `read_tracts()`

The plan is to include this parameter after geobr migrates to using geoparquet files. See this

New vignette showing how to work with larger-than-memory data

New function to add labels to variables

Initially, here's the idea. One function per data set

add_labels_households(arrw, lang = c('PT', 'EN'))
add_labels_population(arrw, lang = c('PT', 'EN'))
add_labels_mortality(arrw, lang = c('PT', 'EN'))

Depending on how it goes, it might be better to have a single function that applies to different data sets. E.g.

add_labels(arrw, dataset = c('households', 'population', 'mortality'), lang = c('PT', 'EN'))

The downside of this first approach is having too many functions, code repetition because of some variables that are in common between datasets. Meanwhile, the downside of the second approach is that the function will be too big, and harder to manage.

Add microdata documentation

1970 - Available in the dev version. Planned for v0.2.0
1980 - Available in the dev version. Planned for v0.2.0
1991 - Available in the dev version. Planned for v0.2.0
2000 - Available in the dev version. Planned for v0.2.0
2010 - Available in the dev version. Planned for v0.2.0

add merge_households parameter

add a merge_households (logical) parameter to indicate whether the function should merge household variables to the output data.

Add 1991 microdata

Add labels to microdata in year 1980

Population dataset
Households dataset

Add labels to microdata in year 1970

Population dataset
Households dataset

Add 2000 and 2010 microdata

v0.2.0 on CRAN

Censo 2022

Tem alguma previsão para o censo 2022 seja incluído no pacote?

Add data dictionary of microdata in year 1991

Add tract-level data of the 2000 census

Source here.

render different pkgdown for dev branch

example

Problema no dicionário households

Estava utilizando o read_households e acabei encontrando um problema no dicionário que mudava o que eu desejava no código.

` #Dicionário utilizado
data_dictionary(year = 2010, dataset = "households", showProgress = TRUE, cache = TRUE)

#Código utilizado

domicilio <- read_households(year = 2010,
columns = NULL,
add_labels = NULL,
as_data_frame = TRUE,
showProgress = TRUE,
cache = TRUE)

#sexo
domicilio2 <- select(domicilio, code_weighting, V0201, V1005) #a coluna 1005 contêm 1 - Masculino, 2 - Feminino. Deveria conter o tipo de área da moradia (urbana, rural e etc).

nomecolunas<-c("Cod_setor","Morador_Alugado","Sexo")
colnames(domicilio2)<-nomecolunas
as.character(domicilio2$Cod_setor)
rm(domicilio2) `

1970 - Available in the dev version. Planned for v0.2.0
1980 - Available in the dev version. Planned for v0.2.0
1991 - Available in the dev version. Planned for v0.2.0
2000 - Available in the dev version. Planned for v0.2.0
2010 - Available in the dev version. Planned for v0.2.0

Add tests to the questionnaire() function, and create vignette

tests
vignette

Improve code coverage

as of 06/Sept/2023

censobr Coverage: 28.31%
R/add_labels_emigration.R: 4.91%
R/add_labels_households.R: 14.89%
R/add_labels_population.R: 23.85%
R/add_labels_families.R: 38.24%
R/add_labels_mortality.R: 44.74%
R/read_families.R: 90.48% R/read_households.R: 90.48%
R/read_population.R: 90.48%
R/censobr_cache.R: 94.44% R/read_emigration.R: 95.24%
R/read_mortality.R: 95.24%

Add Interview manual

running the function interview_manual() will open the pdf of an interview manual on the web browser.

1970 - Available in the dev version. Planned for v0.2.0
1980 - Available in the dev version. Planned for v0.2.0
1991 - Available in the dev version. Planned for v0.2.0
2000 - Available in the dev version. Planned for v0.2.0
2010 - Available in the dev version. Planned for v0.2.0

Add tests to the interview_manual() function, and create vignette

tests
vignette