samthiriot / gosp.dpp Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 462 KB

the direct probabilistic pairing method for generation of synthetic populations

License: GNU General Public License v2.0

R 100.00%

synthetic-population-library synthetic-population r networks network-generator rpackage

gosp.dpp's Introduction

gosp.dpp

Generation of Synthetic Populations: Direct Probabilistic Pairing

user install

From R, you can install it in 2 steps only:

install the devtools package

install.packages("devtools")

then use it to install the package from github

library(devtools)
install_github("samthiriot/gosp.dpp")

you would also better install the optional dependancies:

install.packages(c("ggplot2", "gridExtra", "igraph", "mipfp"))

first steps

To help you start with the method and package, you might have a look to the vignettes:

simple example of generation of a synthetic populations made of dwellings and households: http://htmlpreview.github.io/?https://raw.githubusercontent.com/samthiriot/gosp.dpp/master/inst/doc/compose_dwellings_and_households.html

developer install

clone the repository

from inside the clone

install.packages(c("devtools", "rhub", "knitr", "rmarkdown", "roxygen2"))
library(devtools)
devtools::install()
devtools::load_all()

enjoy!

releasing

generate the data

library(devtools)
devtools::install()
source("data-raw/dwellings_households.R")

run local tests

library(devtools)
devtools::test()

check the package locally

library(devtools)
devtools::check(manual=TRUE, vignette=TRUE)

build the vignettes

build_vignettes()

check on various platforms

before release, we test the package on Windows, MacOSX and Linux

library(rhub)
check()

update comments for CRAN: if relevant, update the comments in cran-comments.md

gosp.dpp's People

Contributors

Stargazers

Watchers

gosp.dpp's Issues

investigate why pij is not respected in case2

data(dwellings_households)
prepared <- matching.prepare(dwellings_households$sample.A, dwellings_households$sample.B, dwellings_households$pdi, dwellings_households$pdj, dwellings_households$pij)
solved <- matching.solve(prepared, nA=50000, nB=40000, nu.A=1, phi.A=1, delta.A=1, gamma=0, delta.B=1, phi.B=1, nu.B=1)
generated <- matching.generate(solved, dwellings_households$sample.A, dwellings_households$sample.B)
plot(generated, dwellings_households$sample.A$sample, dwellings_households$sample.B$sample)

setup a travis build for this package

see http://r-pkgs.had.co.nz/check.html

fix the plotting of frequencies

it should sum the weights

test the package under windows

add a Freeman-Tukey Goodness of fit measure of errors

remove weight column in the result

fix the generation when there is no constraint at all

vignette for case 1

see http://r-pkgs.had.co.nz/vignettes.html

use populations A and B as such, on place

print method for the result of measure.population

share the package on CRAN

make the package quiet

multiple solutions: normalise errors

right now MSE for degrees, frequencies and nA,nB are not on the same scale
so it does not makes sense to sum then and compare them.

Consider linking one population with itself

split the many plots into individual functions

package as a R package

cleanup useless commented debug code

rename cas1 to case1

vignette for the Lille case

http://r-pkgs.had.co.nz/vignettes.html

ensure the genericity of rectify.degree.counts

fix rounding issues in ndi based on both ni AND ci

restore build on travis

enable resolution with multiple hypothesis

document all the methods

unit test the stretching of min and max di and dj

allow the plotting of stats of a solved case

... before generation

define the best way to define ci from ni/di when di=0

manage the case with di = 0 or dj = 0

better traces on the exploration of several possible solutions

trying to generate billions of individuals when some frequencies are 0

When working on a case with many empty cells in frequency, some parameters lead to the generation of very big populations.

To reproduce the case, using INSEE data:

library(data.table)
library(devtools)
load_all()

dwellings_raw <- read.csv(
		file="~/projets/2017\ parcimonious\ iterated\ picking/application_lille/FD_LOGEMTZB_2014.txt", 
		header=T, 
		nrow=50000, 
		sep=";",
		check.names=FALSE
		#,
		#col_types = cols(b=col_factor())
		)

# INPER: nb personnes ménage
# NBPI: nb pieces logement
# SURF: surface logement

sample_dwellings <- gosp.dpp::create_sample(
                data=dwellings_raw,
                encoding = list(
                        # we provide no mapping
                       ),
                weight.colname="IPONDL"
                )

# free some memory
remove(dwellings_raw)


#CATL: categorie
# 	1 : Résidences principales
# 	2 : Logements occasionnels
# 	3 : Résidences secondaires
# 	4 : Logements vacants
# 	Z : Hors logement ordinaire
#
# n'y mettre un foyer que si vacant
pdi <- create_degree_probabilities_table(
                data.frame(
                    'CATL=1'=c(0.0, 1.0),
                    'CATL=2'=c(1.0, 0.0),
                    'CATL=3'=c(1.0, 0.0),
                    'CATL=4'=c(1.0, 0.0),
                    'CATL=Z'=c(1.0, 0.0),
                    check.names=FALSE
                    )
                )


#
households_raw <- read.csv(
		file="~/projets/2017\ parcimonious\ iterated\ picking/application_lille/FD_INDCVIZB_2014.txt", 
		header=T, 
		nrow=10000, 
		sep=";",
		check.names=FALSE
		#,
		#col_types = cols(b=col_factor())
		)
sample_households <- gosp.dpp::create_sample(
                data=households_raw,
                encoding = list(
                        # we provide no mapping
                       ),
                weight.colname="IPONDI"
                )
remove(households_raw)


pdj <- create_degree_probabilities_table(
                data.frame(
                    'STOCD=00'=c(1.0, 0.000001),
                    'STOCD=10'=c(0.000001, 1.0),
                    'STOCD=21'=c(0.000001, 1.0),
                    'STOCD=22'=c(0.000001, 1.0),
                    'STOCD=23'=c(0.000001, 1.0),
                    'STOCD=30'=c(1.0, 0.000001),
                    'STOCD=ZZ'=c(1.0, 0.000001),
                    check.names=FALSE
                    ),
                norm=TRUE
                )


# STOCD
	# 00 : Logement ordinaire inoccupé
	# 10 : Propriétaire
	# 21 : Locataire ou sous-locataire d'un logement loué vide non HLM
	# 22 : Locataire ou sous-locataire d'un logement loué vide HLM
	# 23 : Locataire ou sous-locataire d'un logement loué meublé ou d'une chambre d'hôtel
	# 30 : Logé gratuitement
	# ZZ : Hors logement ordinaire

# TYPL
	# Type de logement
	# 1 : Maison
	# 2 : Appartement
	# 3 : Logement-foyer
	# 4 : Chambre d'hôtel
	# 5 : Habitation de fortune
	# 6 : Pièce indépendante (ayant sa propre entrée)
	# Z : Hors logement ordinaire

# INPER: nb personnes ménage


# SURF: surface logement
	# 6 5 3 4 7 1 2

pij <- create_matching_probabilities_table(
		normalise(
			data.frame(
				"SURF=1"=c(1.0, 1.0, 0.7, 0.4, 0.1, 0.1, 0.001, 0.001, 0.001, 0.001, 0.001, 1.0), 
				"SURF=2"=c(1.0, 1.0, 1.0, 0.7, 0.4, 0.1, 0.1, 0.001, 0.001, 0.001, 0.001, 0.0), 
				"SURF=3"=c(0.8, 1.0, 1.0, 1.0, 0.7, 0.4, 0.1, 0.1, 0.001, 0.001, 0.001, 0.0), 
				"SURF=4"=c(0.3, 0.8, 1.0, 1.0, 1.0, 0.7, 0.4, 0.1, 0.1, 0.001, 0.001, 0.0), 
				"SURF=5"=c(0.3, 0.3, 0.8, 1.0, 1.0, 1.0, 0.7, 0.4, 0.1, 0.1, 0.1, 0.0), 
				"SURF=6"=c(0.1, 0.3, 0.3, 0.8, 1.0, 1.0, 1.0, 0.7, 0.4, 0.1, 0.1, 0.0), 
				"SURF=7"=c(0.01,  0.1, 0.3, 0.3, 0.8, 1.0, 1.0, 1.0, 0.7, 0.4, 0.4, 0.0), 
		        row.names=c("INPER=1", "INPER=2", "INPER=3",  "INPER=4",  "INPER=5",  "INPER=6",  "INPER=7",  "INPER=8",  "INPER=9",  "INPER=10",  "INPER=11", "INPER=Z"), 
		        check.names=FALSE
		        )
			)
		)

prepared <- matching.prepare(sample_dwellings, sample_households, pdi, pdj, pij) 

solved <- matching.solve(prepared, nA=50000, nB=40000, nu.A=1, phi.A=0, delta.A=1, gamma=1, delta.B=1, phi.B=1, nu.B=1, verbose=T)

solved$gen$hat.nB
[1] 7002050000

make the dependancy to igraph optional

create a unit method as an entry point ?

unit tests on cas1: more tests on the actual generation

multiple solutions: create tradeoffs between solutions

typically if a good solution is hat.nA = nA, another hat.nB =nB, we might split the error between nA and nB

rename functions according to new vocab

add support for multiple attributes

check why we sometimes miss one or two individuals at generation time

rename matching to pairing ?

replace peering by pairing (?)

test the resizing of populations

unit tests on empty cells

test:

empty cells if fi, di
empty cells due to pdi, pdj
empty cells due to pij

accept frequencies fi and fj as an optional parameter

test case: ensure sample data only contains one line per example !

manage the case with 0 in fi or fj

solving fails on some system with a "subscript error"

When testing on Debian Linux, R-release, an error is raised

* checking tests ...
  Running ‘testthat.R’
 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
         gamma = 0, delta.B = 1, phi.B = 1, nu.B = 1, verbose = FALSE) at testthat/test_basic1.R:389
  2: resolve(sol, case, nA, nB, nu.A, phi.A, delta.A, nu.B, phi.B, delta.B, gamma, verbose = verbose)
  3: resolve.missing.chain(sol.tmp, chain, case, nA, nB, nu.A, phi.A, delta.A, nu.B, phi.B, 
         delta.B, gamma, verbose = verbose)
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  OK: 632 SKIPPED: 2 FAILED: 5
  1. Error: constraints: nA, phi.A, phi.B (@test_basic1.R#85) 
  2. Error: constraints: phi.A, delta.A (free on matching and B) (@test_basic1.R#167) 
  3. Error: constraints: phi.A, gamma (free on A and B) (@test_basic1.R#188) 
  4. Error: constraints: nothing (totally free - long chain) (@test_basic1.R#305) 
  5. Error: constraints: pdi with zero (p(di=0)=1.0) (@test_basic1.R#389) 
  
  Error: testthat unit tests failed
  Execution halted