udst / synthpop Goto Github PK

View Code? Open in Web Editor NEW

99.0 99.0 47.0 1.04 MB

Synthetic populations from census data

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

synthpop's People

Contributors

Stargazers

Watchers

Forkers

sfcta osplanning azmag amos5 mckennasean semcogli yanjiedong semcog eh2406 stuartlynn bridwell faical-yannick-congo compumetrika cvanoli afcarl novavic chakravarthi1987 beelabs jiyanblack szaher fagan2888 reasy lbnl-ucb-sti paran93 tm-89 tomtransposition transpositionaust bobkatla ks-at-ac urban-foresight-at-ac blnproject mgvo knaaptime diedrebrown stinay iabih werdnabae mapmaker2023 alinutzal eviey ayushbhardwaj321 bhaveshneekhra

synthpop's Issues

return person records

Need to return the appropriate person records as well as the household records (we only return the household records right now).

Should probably also think about if the user wants some other data format than the raw pums records, which is what we return right now. I imagine they would optionally desire a more "sane" output.

parallelization

At some point we should try parallelizing individual geographies, or maybe parallelizing counties

Choose a license for popgen

We should choose a license for popgen. Here are some common ones: http://opensource.org/licenses

UrbanSim is currently covered by the GNU Affero GPL: http://opensource.org/licenses/AGPL-3.0

I was trying out one of the demos (https://github.com/UDST/synthpop/blob/master/demos/simple_synthesis.ipynb) and noticed, that the there used module synthpop.general_synthesizer doesn't seem to exist in the current version :

Was the module discarded or simply replaced?

Thanks for the help :)

Update README with installation instructions

README should contain installation instructions for environments created with virtualenv and conda. For virtualenv could be as follow:

virtualenv venv --python=python3.7
source venv/bin/activate
pip install -r requierements.txt
cd synthpop/
python setup.py develop
ipython kernel install --user --name=synthpop
Add to /venv/bin/activate the following line: export CENSUS='1234ebcf'

HTTP Error 403: Forbidden

Hi there,
I'm trying to run synthpop but I run into an HTTP Error 403: Forbidden.
My census key is fine.
The Census class defined in census_helpers.py sets the following URL as the base_url:

class Census:

    def __init__(self, key):
        self.c = census.Census(key)
        self.base_url = "https://s3-us-west-2.amazonaws.com/synthpop-data/"

I've tried to access it on a browser, all it says is:
All Access Disabled All access to this object has been disabled

I guess the problem lies there: could you please give me some pointers on how I can solve this issue?
Thank you very much!

URLs

Good morning,

2 URL issues in census_helpers.py:

self.base_url = "http://paris.urbansim.org/data/pums/"
This URL is not working now. It worked weeks ago when I first try this tool.
self.fips_url = "https://www.census.gov/geo/reference/codes/files/"
"national_county.txt"
This URL is currently "http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt", and it doesn't have column names in the file.

Thanks!

Rounding problem for Census BG marginals

Right now, "Cehavees_helper" downloads and splits Tract controls into Block Group controls when only Tract summaries are available. However, the " _scale_and_merge" function uses astype (int) to convert the final division results, which could lead to unmatched marginal totals. I noticed there's a comment saying "round?"(line 47). But the rounding wasn't implemented. I am wondering why?

Test for State 26, County 125, Tract 165100, BG (1,2,3), hh cars and hh workers only have tract summaries. At the county level, we see thousands of HHs less in those 2 categories.

Current method,
hh_age_of_head 869 598 277
hh_cars 866 596 275
hh_children 869 598 277
hh_income 869 598 277
hh_race_of_head 869 598 277
hh_size 869 598 277
hh_workers 867 597 275
hispanic_head 869 598 277

Round first then astype(int)(much better)
hh_age_of_head 869 598 277
hh_cars 869 598 277
hh_children 869 598 277
hh_income 869 598 277
hh_race_of_head 869 598 277
hh_size 869 598 277
hh_workers 869 597 277
hispanic_head 869 598 277

Census instance has no attribute 'tract_to_pums'

Hi,
in the census_api demo, I stumble into this error: the Census instance doesn't seem to have the attribute "tract_to_pums" (see following screenshot):

Thank you very much.

scipy.stats.chisquare throws error in compare_constraints.py due to unequal sum of frequencies

Hi, thanks for the great library. I'm trying to use this to generate some populations and I'm consistently getting the same error, even when I run the examples in this repository.

"ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:" ...

This error is coming from scipy.stats.chisquare which is called by the compare_to_constraints function. It appears that the sum of constraints.values is not always equal to the sum of counts.values (within the required tolerance) and this causes the error.

Failure with last pandas version in `ipu.py`

Setting up synthpop in a virtualenv working with python 3.7 was returning the following error:

File "/home/fedec/urbansim/spop/synthpop/synthpop/ipu/ipu.py", line 28, in _drop_zeros
    for (col_idx, (col, nz)) in df.apply(for_each_col, axis=0, raw=True).items():
  File "/home/fedec/urbansim/spop/synthpop/venv/lib/python3.7/site-packages/pandas-1.1.0rc0-py3.7-linux-x86_64.egg/pandas/core/frame.py", line 7541, in apply
    return op.get_result()
  File "/home/fedec/urbansim/spop/synthpop/venv/lib/python3.7/site-packages/pandas-1.1.0rc0-py3.7-linux-x86_64.egg/pandas/core/apply.py", line 178, in get_result
    return self.apply_raw()
  File "/home/fedec/urbansim/spop/synthpop/venv/lib/python3.7/site-packages/pandas-1.1.0rc0-py3.7-linux-x86_64.egg/pandas/core/apply.py", line 219, in apply_raw
    result = np.apply_along_axis(self.f, self.axis, self.values)
  File "<__array_function__ internals>", line 6, in apply_along_axis
  File "/home/fedec/urbansim/spop/synthpop/venv/lib/python3.7/site-packages/numpy-1.19.1-py3.7-linux-x86_64.egg/numpy/lib/shape_base.py", line 402, in apply_along_axis
    buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
ValueError: could not broadcast input array from shape (2,2) into shape (2,0)

When the hh or p cols created here and here are unpacked here the result of the drop zeros function raises the error above.

I've solved by modyfing the setup.py pandas version, downgrading from pandas==1.1.0rc to 1.0.5.

Since the drop_zeros function was applied to a dataframe and results haven´t the same lentgh unpacking the yield wrapper raises the posted error.

Mismatch between `hh` and `p` tables regarding the amount of persons for 2018 results

Allocating households and persons from block group to block we found differences between synthetic tables.

To sum up, we generally found that we often have more persons in household table than persons in persons table. Also, more household_idx in first table than in second one.

We already checked consistency between ACS tracts and pums2018 files and every value has a puma10 file matching by state.
At the moment, most strong hypothesis is that we are missing serial numbers suring the synthesis.

From checked cases, only ST 05, county 001 has been correctly synthesized.

BUG: ipu not using all constraints

Some constraints given to ipu are ignored. I just noted that the number of columns in _FrequencyAndConstraints.ncols is not equal to len(person_freq.columns) + len(household_freq.columns) This means that some of the constraints are not being used.

Now trying the 5 whys method:

Why 1?

Because the keys of OrderedDict must be unique, and the column names of person_freq and household_freq are sequential numbers. So the person constraints over write the household constraints.

Why 2?

Why are they sequential numbers, they are not in the tests? In the unit test the columns have unique meaningful names.
Because we set them to sequential numbers. That is how it has been since @fscottfoti added it to git.

And that is as far as I go.

How do we fix?

Do we stop using a Dict as a backing for _FrequencyAndConstraints, what consequences does that have? Do we stop changing the index we send, what consequences does that have?

We noted this while we are finalizing our synthesized population so we could use a prompt fix.

No module named 'synthpop'

hi, I am new for python. After finishing the install set up.py and have a folder named as synthpop.egg-info
I try import synthpop.zone_synthesizer as zs in Spider but shows
"ModuleNotFoundError: No module named 'synthpop'" in my CMD.
Do I need do something specific of ANACONDA?
I already download it and put in my C drive, which is same drive with setup.py
Thank you.

full scale test

Should probably synthesize the population of the Bay Area and solve any issues that come up. If it's fast enough we should go for the whole county (why not?).

input data directory?

Does anyone think that this directory is still being used?

https://github.com/synthicity/synthpop/tree/master/demos/input_data

If not we should remove it...

Initial Quality Assessment

I recorded some of the quality data from Napa County, which is pasted below. Low chi-squared is better (ideally less than 1) and high p-value is better. (Each indicating similarity between the expected and observed distributions.) One thing that stands out here is that some block groups turn out pretty well and others don't, and that that's repeatable between runs (it's not random chance). It seems like there's something about those particular block groups that help us end up with a good fit or poor fit that'll require some more investigation. I'm open for ideas on other ways of evaluating the final quality of the synthesis.

Geography: 06 055 201403 2
    num households:  202
    household chisq: 5.43088089045
    household p:     5.21073120377e-34
    people chisq:    16.0817647772
    people p:        9.93781526248e-86
Geography: 06 055 200706 2
    num households:  314
    household chisq: 0.598180266206
    household p:     0.991095451554
    people chisq:    1.5979216509
    people p:        0.0186412607763
Geography: 06 055 201102 2
    num households:  326
    household chisq: 1.55198810451
    household p:     0.00614775663362
    people chisq:    5.09426306818
    people p:        6.05271644872e-19
Geography: 06 055 200202 2
    num households:  151
    household chisq: 1.57294117642
    household p:     0.00488296278056
    people chisq:    9.81533427718
    people p:        1.22522219051e-46
Geography: 06 055 201601 1
    num households:  473
    household chisq: 6.02547998661
    household p:     1.03611171429e-39
    people chisq:    6.29655747411
    people p:        1.01216773952e-25
Geography: 06 055 201401 1
    num households:  341
    household chisq: 3.19792886587
    household p:     4.15912318716e-14
    people chisq:    5.49037386335
    people p:        3.8045168528e-21
Geography: 06 055 200802 1
    num households:  348
    household chisq: 1.61951419488
    household p:     0.00288634250822
    people chisq:    1.82795414481
    people p:        0.00326723157248
Geography: 06 055 201403 1
    num households:  93
    household chisq: 4997.26506248
    household p:     0.0
    people chisq:    13.7349343251
    people p:        6.40236917105e-71
Geography: 06 055 201200 1
    num households:  257
    household chisq: 1.94674866466
    household p:     4.48102332262e-05
    people chisq:    2.03213659632
    people p:        0.000589862312201
Geography: 06 055 200706 3
    num households:  343
    household chisq: 1.24396606265
    household p:     0.109385822386
    people chisq:    1.56084161759
    people p:        0.024157555149
Geography: 06 055 201102 1
    num households:  477
    household chisq: 2.7121000322
    household p:     2.62111652853e-10
    people chisq:    5.16074956589
    people p:        2.59917511571e-19
Geography: 06 055 200504 2
    num households:  1185
    household chisq: 3.39998356272
    household p:     9.15364032494e-16
    people chisq:    10.994263177
    people p:        7.27205255309e-54
Geography: 06 055 200804 2
    num households:  400
    household chisq: 0.81717537662
    household p:     0.826306619365
    people chisq:    0.914239917016
    people p:        0.603490408465
Geography: 06 055 200203 1
    num households:  420
    household chisq: 0.590906858823
    household p:     0.992300753267
    people chisq:    1.20234540617
    people p:        0.202724531156

How to install/run Synthpop

Hi all,

Could anyone please provide some guidance about how to install or run Synthpop? I have been recently assigned a task related with Synthpop and had a hard time doing this job. I would really appreciate it.

System is computationally singular when using parametric methods

Hello
when I try to generate synthetic data using parametric method, I'm getting the following error:
Error in solve.default(xtx + diag(pen)) :
system is computationally singular: reciprocal condition number = 6.01891e-17

I tried to drop features that are highly correlated (with correlation coefficient more than 0.8) but the error still exist. is there any way to fix this error?

Mismatch between constraints and number of households to draw

I am running into the RunTimeError indicated here:

synthpop/synthpop/draw.py

Lines 72 to 75 in 274df40

    
           raise RuntimeError( 
        
               'There is a mismatch between the constraints and the total ' 
        
               'number of households to draw. The total to draw appears ' 
        
               'to be higher than indicated by the constraints.')

Can anybody provide insight into the circumstances in which this condition is hit? For reference, I have defined my own recipe of Census fields to pull based on starter2.Starter, and am running into this error in some counties but not all.

double check the zero marginal and zero cell problem

We should probably verify that we're doing the right thing from the paper, and if we need more functionality here.

documentation

Probably goes without saying, but I'll say it.

I've been bad about writing doc strings
We need high level descriptions, especially about the recipes
We need annotated example notebooks
We also need a reasonable readme with credit to the appropriate places

Dramatically Slow and RAM consuming.

Hi,

I have managed to create the tables of seeds from 1% Census data that give me around 5600 Households seeds and 11700 person seeds. I have tried to generate the population for nearly 3700 Census Block. The code appears very slow in performing the synthesis for all the 3700 blocks. Moreover, I had to reduce the number of households and person seeds because in the first extraction the was larger than 5600 hh_seeds and 11700 pp_seeds I had some memory issues. However, even in this case, the code seems very ram consuming up to 25Gb. If you need I can provide my data for some test.

Best,
Lorenzo Bottaccioli

Memory Error - when executing demos/synthesize.py

Hello,

I am trying to run synthesize.py in the demos for 1 county but the program fails because of a memory error. My computer has 8GB RAM.
Is there a minimum memory requirement?

Below is the stack trace.

synthpop-master/demos$ python synthesize.py "CA" "Santa Clara County"

Synthesizing at geog level: 'block_group' (number of geographies is 1075)
Synthesizing geog id:
 state              06
county            085
tract          500100
block group         1
dtype: object
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 5708, in _reduce
    values = self.values
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3811, in values
    return self.as_matrix()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3790, in as_matrix
    self._consolidate_inplace()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3677, in _consolidate_inplace
    self._protect_consolidate(f)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3666, in _protect_consolidate
    result = f()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3675, in f
    self._data = self._data.consolidate()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3826, in consolidate
    bm._consolidate_inplace()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3831, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4853, in _consolidate
    _can_consolidate=_can_consolidate)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4876, in _merge_blocks
    new_values = new_values[argsort]
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "synthesize.py", line 24, in <module>
    households, people, fit_quality = synthesize_all(starter, indexes=indexes)
  File "/usr/local/lib/python3.5/dist-packages/SynthPop-0.1.dev0-py3.5.egg/synthpop/synthesizer.py", line 142, in synthesize_all
    hh_index_start=hh_index_start)
  File "/usr/local/lib/python3.5/dist-packages/SynthPop-0.1.dev0-py3.5.egg/synthpop/synthesizer.py", line 64, in synthesize
    h_jd.cat_id)
  File "/usr/local/lib/python3.5/dist-packages/SynthPop-0.1.dev0-py3.5.egg/synthpop/categorizer.py", line 116, in frequency_tables
    household_cat_ids)
  File "/usr/local/lib/python3.5/dist-packages/SynthPop-0.1.dev0-py3.5.egg/synthpop/categorizer.py", line 103, in _frequency_table
    assert df.sum().sum() == len(sample_df)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 7295, in stat_func
    numeric_only=numeric_only, min_count=min_count)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 5733, in _reduce
    data = self._get_numeric_data()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3745, in _get_numeric_data
    self._data.get_numeric_data()).__finalize__(self)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3587, in get_numeric_data
    self._consolidate_inplace()
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3831, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4853, in _consolidate
    _can_consolidate=_can_consolidate)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4873, in _merge_blocks
    new_values = _vstack([b.values for b in blocks], dtype)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4919, in _vstack
    return np.vstack(to_stack)
  File "/usr/lib/python3/dist-packages/numpy/core/shape_base.py", line 230, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError

Thanks for the help

Recipe check utility

In developing my own recipe to synthesize a population using ACS and PUMS, I ran into the below error:

 raise RuntimeError( 
     'There is a mismatch between the constraints and the total ' 
     'number of households to draw. The total to draw appears ' 
     'to be higher than indicated by the constraints.')

The ultimate cause was found to be a typo in my recipe, where one field that was to be mapped between the two datasets had a value typed incorrectly. As a result, the above error arose in some block groups when this particular value arose in the marginal distribution.

Proposing developing a utility for checking recipes before they are run through the synthesizer. This could check for:

Field name mapping consistency
Field value mapping consistency
Presence and correct structure of required methods (e.g., get_household_marginal_for_geography).

drawing methods

We probably need an open issue on drawing methods.

Implement the drawing method from the paper
Implement the chi-squared statistic to know if the draw is reasonable
Implement redraw until the chi-squared stat is appropriate

Any other drawing methods? I tend to think doing the one from the paper will be sufficient.

Need to use older versions of panda

I am trying the notebook demo simple_synthesis and the 3rd cell gives the following error:

RuntimeError Traceback (most recent call last)
in
----> 1 hh_marg, p_marg, hh_sample, p_sample, xwalk = zs.load_data(hh_marginal_file, person_marginal_file, hh_sample_file, person_sample_file)

~\AppData\Local\Continuum\anaconda3\envs\synthpop\lib\site-packages\synthpop-0.1.1-py3.7.egg\synthpop\zone_synthesizer.py in load_data(hh_marginal_file, person_marginal_file, hh_sample_file, person_sample_file)
40
41 hh_marg = pd.read_csv(hh_marginal_file, header=[0, 1], index_col=0)
---> 42 hh_marg.columns.levels[0].name = 'cat_name'
43 hh_marg.columns.levels[1].name = 'cat_values'
44

~\AppData\Local\Continuum\anaconda3\envs\synthpop\lib\site-packages\pandas-1.0.3-py3.7-win-amd64.egg\pandas\core\indexes\base.py in name(self, value)
1189 # Used in MultiIndex.levels to avoid silently ignoring name updates.
1190 raise RuntimeError(
-> 1191 "Cannot set name on a level of a MultiIndex. Use "
1192 "'MultiIndex.set_names' instead."
1193 )

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

I am guessing in an older version of pandas the lines below in the function

42: hh_marg.columns.levels[0].name = 'cat_name'
43: hh_marg.columns.levels[1].name = 'cat_values'
49:p_marg.columns.levels[0].name = 'MultiIndex.set_names'
50:p_marg.columns.levels[1].name = 'cat_values'

Removing them seems to cause things to work with Pandas version 1.03.

conda build fails due to ez_setup

When trying to run conda build --python=36 meta.yaml (file here) it throws:

(...)
    raise ImportError
ImportError

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "setup.py", line 2, in <module>
    use_setuptools()
(...)
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: SSL is required
(...)

I think it relates to this issue.
While we wait for the fix, maybe we can do something like this (from UrbanAcces setup.py):

try:
    import setuptools
except ImportError:
    from ez_setup import use_setuptools

    use_setuptools()

from setuptools import setup, find_packages

infinite loop in ipu

Looks like we're encountering that infinite loop described in the paper. @jiffyclub you ought to be able to recreate it by running this notebook:

http://nbviewer.ipython.org/github/synthicity/synthpop/blob/misc-improvements/demos/synthesize.ipynb

Package naming

Maybe we can rename SynthPop to synthpop or synth-pop here.
Otherwise when creating Pypi package, it names it as SynthPop 0.1.dev0 which is difficult to find since searching for synthpop will throw no results. Same happens with the downloadable package name: SynthPop-0.1.dev0.tar.gz.

Extremely large person errors for some rows using non_census_synthesis

I am getting extremely large errors using the sample data (hh_marginals.csv, household_sample.csv, person_marginals.csv, person_sample.csv) and I'm generating the synthetic population using the non_census_synthesis notebook. The generated households match the marginals very well, but the persons are not matched well at all.

In this picture, I calculate the percent difference between the synthesized and actual marginals. As you can see many of the differences are very large.

I've also tried generating synthesis using my own queried data, and I'm having the same problem with the person distributions not matching well.

Possible biased weighting towards constraint categories with counts <5.

In draw.compare_to_constraints constraint categories with counts of 0 are ignored, as this would lead to \chi^2 = \inf . Is there an argument for ignoring all counts <5 as this leads to a generally inflated \chi^2 value. If in draw.draw_households you accept the draw of households that gives the lowest \chi^2, you could be biasing your fit towards constraint categories with counts <5.

I am not sure how big an issue this is. It may only be a problem where there are a number of constraint categories with counts <5.

	raise RuntimeError(
	'There is a mismatch between the constraints and the total '
	'number of households to draw. The total to draw appears '
	'to be higher than indicated by the constraints.')