liam2 / liam2 Goto Github PK

Framework for developping microsimulation models

License: GNU General Public License v3.0

C 7.54% C++ 3.82% HTML 7.99% Python 76.50% Batchfile 0.46% Makefile 0.11% Cython 3.58%

liam2's Introduction

LIAM2 is a tool to develop (different kinds of) :ref:`microsimulation <microsimulation>` models.

The goal of the project is to let modellers concentrate on what is strictly specific to their model without having to worry about the technical details. This is achieved by providing a generic microsimulation toolbox which is not tied to a particular model. By making it available for free, our hope is to greatly reduce the development costs (in terms of both time and money) of microsimulation models.

The toolbox is made as generic as possible so that it can be used to develop almost any microsimulation model as long as it uses cross-sectional ageing, ie all individuals are simulated at the same time for one period, then for the next period, etc.

License

LIAM2 is licensed under the GNU General Public License (GPL) version 3. This means you can freely use, copy, modify and redistribute this software provided you follow a few conditions. See the license text for details.

Staying informed

You can get notified of new versions and other LIAM2-related announcements by either using our website RSS feed or subscribing to the LIAM2-announce mailing list on Google Groups.

You can do so by entering your email address here:

Credits

The software is primarily being developed at the Federal Planning Bureau (Belgium), with testing and funding by CEPS/INSTEAD (Luxembourg) and IGSS (Luxembourg), and funding from the European Commission. See the :ref:`credits page <credits>` for details.

liam2's People

Contributors

Stargazers

Watchers

Forkers

taxipp-life johnnam antoinearnoud gvk489 slee1009 gdementen jenniebui hong-sung-hyun vishalbelsare gitscotty shivarajmishra

liam2's Issues

improve tutorial/examples

comment each new stuff as it appears the first time
? kill logit_regr without expr (logit_regr(0.0, align='al_p_dead_m.csv'):
use align(uniform(), 'al_p_dead_m.csv') instead?
kill children link, use it or at least add a comment about it
(it is declared but not used in any example)

store columns individually instead of in a table

this should improve compression if any, possibly improve performance, make it possible to only load variables as-they-are needed

allow to load globals in the simulation

ie load globals in their declaration in the simulation file (effectively allowing to bypass the import step for time series and other small csv files)

better error when importing a simulation file or running an import file

use numexpr.E instead of passing strings

The problem is that numexpr.E always creates nodes of kind "double". We could use a more generic NodeMaker:

class NodeMaker(object):
def init(self, kind=None):
self._kind = kind

   def __getattr__(self, name):
       if name.startswith('_'):
           return self.__dict__[name]
       else:
           return ne.expressions.VariableNode(name, self._kind)

INTNODE = NodeMaker('int')

though, I think always using VariableNode directly would be simpler.

chunking for all operationsto to simulate any size of data

alignment will be hard to do right when there is ranking involved because if we simply apply alignment to each chunk:
- the individual paths would potentially be much worse, because in each chunk it would take the x% best scoring individuals but the best in a chunk could very well be the worst of another chunk and thus would not be selected if the whole population was simulated at once.
- the total number would be slightly less precise because we would round to integers in each chunk instead of for the whole population. This can be mitigated/corrected by storing the error in each chunk and adapt the target in the next chunk.
- however, if there is no deterministic ranking, it should be as good as a single block simulation.
- if the ranking score has only a small set of possible values, we could partition individuals using their values, eg with integer:
  partition 1: values 0-10
  partition 2: values 11-20
  etc...
  This should allow to treat/sort chunks separately without having to reconstruct the complete array in memory (see mapreduce sort). However, I don't think this is the case here, so I'm stuck...
- if the number of actually selected individuals is small enough, we could keep the array of selected people by category in memory (along with their score) from chunk to chunk and replace individuals with lower scores by those with higher scores as they are found.
matching is problematic
new is problematic as it breaks the implicit sort by id of the table
- I have to "block" other processes during the new. The new itself can be done progressively as it doesn't modify any variable.
remove seem to be ok

improve behavior on a/0 and 0/0

When there is a divide by zero or 0/0, numpy outputs a RuntimeWarning which seems to confuse users (especially 0/0 which produces a cryptic warning: RuntimeWarning: invalid value encountered in double_scalars). This can happen at least in a groupby() call with percent=True when the total is 0.0, for example if expr = grpcount() and there is no individual matching the filter -- see example in the test simulation.

The situation in release 0.5.1 was even messier because of the code of avglink which disabled the warning in the middle of the simulation and did not restore the behaviour afterwards, which means that if we had an avglink before an error, we did not get any warning, but we did get some if we did not use any avglink or avglink was after the error.

I think the ideal solution here would be to let the user configure what he wants (raise/warn/ignore) but show him the line number in his code, though that will be hard to achieve. Easier to achieve would be to print the expr where it happened. It should be possible to do this by using np.seterrcall:

old_handler = np.seterrcall(my_handler)
old_err = np.seterr(all='call')

and inspect the caller frame at that point (using sys._getframe([depth]).

It would be cleaner to use np.seterr(all='raise') and catch the exception in Expr.evaluate, possibly simply using add_context. But that would only work for the "raise" behaviour which is not what all users want, since divide by 0 and 0/0 can be normal/expected result.

For the "warn" or "message" case, I think the simplest option is to use the caller frame hack. Another option would be to use seterr("raise") at first but in the exception handler (at expr level), use seterr("ignore") and re-evaluate the expression. Hmmm, that would only work once per class of exception (invalid, overflow, ...), not once per exception location, as warnings do (and this is the behaviour we would like).

Yet another option is to call np.seterrcall(my_handler) before any expression is evaluated by numpy. This might be slow though. We would use something like:

    def make_err_handler(expr):
        msg = str(expr)
        def handler(a, b, c): # check what the signature needs to be
            # raise/print/... msg
        return handler

    def evaluate(self):
        myhandler = make_err_handler(self)
        np.seterrcall(myhandler)
        # do real stuff here

make macros available via links

it should be easier now that when the move away from the conditional context is done. I should probably also merge entity .all_symbols and entity.variables?

pure Python mode

implement a // b

as a cleaner and faster equivalent to trunc(a / b)

review field def

do input fields need to be declared at all? I could collect variables from all expressions and check that they are all present in the input file. however, some information about input fields will need to be defined anyway: enumerations, ranges, default values, ... (unless I store all that in the input file metadata)

review the whole filter concept

the missing "else" value is often a problem, but can we avoid it? do we want to avoid it?

add command to console to list macros

prevent running two simulations on the same output file at the same time

I have a report that it causes all sorts of strange problems but I could not reproduce it here (on my very small test case).

make expr.dtype() return (shape, dtype)

(ndim, dtype) might be enough. The immediate goal is to be able to distinguish between arrays and scalars in advance in methods like Variable.getitem and refuse [] on scalars at compile time

add an extra ASTNode pass

the goal is to "compile" expressions to another representation which:

has an uniform interface, so that it can be walked/traversed recursively without having to define a traverse method on most expression classes.
has not its eq method overloaded so that I can use "expr in d" (dict.contains(k) calls k.eq) so that I can store expressions in a dict to check whether they occur more than once

Questions:

use ASTNode class in numexpr as is?
would an uniform interface make things cleaner for current functionalities (mainly as_string, collect_variables and simplify)?

use a different value for missing int in non-link columns

we need to keep -1 for link columns

use proper logging and warnings

use the logging and warnings modules

review align() function/alignment process

I don't want to code anything just yet, I just want to know if the current syntax will need to be modified.

Q: ~~Should it return False for all the persons which are not selected? instead of the current weird situation: False if stored in a temp variable, otherwise don't modify those which are not selected.~~
A: ~~Yes!~~ This has been the case for a looong time now (since release 0.6).
Q: What would the syntax be for multi-value alignment (when there are more than two possible outcomes): align([2, 3, 4], ... ?
A: See al_p_multi_value.csv
Q: So, I have a format in which the user can express the needed proportions and I can easily compute the actual (integer) number of individuals needed for each value. Now, how does the user specify a "hint" of who should get which value? and how do I implement that?
A: One way to do that (and that is what is apparently done for multinomial logistic regressions), is to have the user provide one score expression for each possible outcome value. This assumes all score expressions use the same "scale", otherwise some outcomes will not get the individuals with the highest score for that outcome.

outcome = np.empty(ctx_length)
outcome.fill(missing_value)
for outcome_value, score_values in zip(...):
    sorted_indices = score_values.argsort()
    for idx in sorted_indices:
        if score_values[idx] > other_score[other outcomes]:
            outcome[idx] = outcome_value
            num_taken += 1
        elif score_values[idx] == other_score[other outcomes]:
            # break ties randomly
            outcome[idx] = choice(outcomes[score == score_values[idx]])
            num_taken += 1
        if num_taken == target_num:
            break

    # optimization: delete individuals that were taken for this outcome,
    # so that other outcomes don't have to compare with it

check versions of dependencies on startup

... instead of letting it crash randomly in the middle of a simulation with an error message difficult to understand for users (for example with numexpr 1.4.2)

work around numexpr 2.x limitation of 31 variables

I guess I will need to create hidden temporary variables

full test coverage

use unit tests (this presuppose pure Python mode)

upgrade carray to blz

use more of putmask and put

replace a[bool_array] = x and a[int_array] = x by putmask and put as they seem to be universally (a bit) faster

expose all numpy functionalities

encapsulate indexing stuff in the data provider

possibly use pytables indexing engine

Load only a subset of input columns at the start of each period

If I want to run a simulation but I already know some values (not all of them) for some individual in, says, period 3.

That question can hide two distinct questions.

First, how to load only one variable or any subset ?
Then if these values are loaded when a new period start, are they modified by the processes ?

examples fail (out-of-the-box) on Linux

use "/" in the examples and use os.path.normpath(path)?

groupby on a variable with nans gives awful results

each nan is considered as a different value, which means an awful lot of them

support this syntax: 18 <= age <= 90

we will need to fiddle some more with the AST but it should be possible

aggregates with filter on array with 2+ dimensions

for grpmin, grpmax & grpsum, we could specify "ignore values" for each function and dtype and do values[filter_value] = ignore_value. This might be slower than nanmin/nanmax but at least it would work. For grpstd, it would be trickier but possible (ignore value is the mean) but for median & percentile that method would simply not work.

launch external command during simulation

for example, we could run an external model in another program. That should be easy. However, to be useful, we need to be able to load back the result in the middle of a simulation.

F9 on "import files"

so that you can check what you just imported

implement many2many

transparently create extra array and table with 3 fields:
- period
- side1_id
- side2_id
store a "pointer" to the array into the Link instance
store a "pointer" to the table in ???
adapt all "link" functions
how do we set those relationships?
via actions?
- add_child: append(children, child_id)
- remove_child: remove(children, child_id)
  or via methods on the link (I think I prefer this):
- add_child: children.append(child_id)
- remove_child: children.remove(child_id)

make all arguments to all functions accept expressions

there should be a base class handling this

implement some kind of global namespace

... where we can put anything not tied to an entity (eg, all scalar values, groupby results, ...), so that we can access it from another entity. For example, we could use in household a scalar (or groupby) value stored in a (temporary) global by a person process

attach variables to their entity

this would make "variable checking" (collect_variable) cleaner and more robust. We might need to also attach globals to their "table".

precompile numexpr expressions as soon as possible

ie, add an explicit compilation step. We will need to know the type of each variable in advance. For that to happen, there are a few problems:

in Entity.variables, we create temporary variables without types. We need
the variables to exist to be able to parse the expressions.
We could either:
1. pre-parse expressions (using compile) to know which variables each
  expression depend on, make a topological sort on that and parse
  expressions in the correct order. This way, the types of all variables of
  an expression should be known by the time it is parsed.
2. parse all expressions using untyped temporary variables (like we do
  now), get all variables for each expression (using collect_variables),
  topological sort all expressions which are stored in a temporary
  variable and "type" them one by one.
Option 1 seem cleaner and will probably be faster but I wonder if it
will be able to handle my "conditional context" stuff (not sure option 2
can either).

replace breakpoint(period) by breakpoint(condition)

merge code for loading ndarrays and tables

we should merge importer.load_ndarray and importer.CSV(), they are both special cases of nd record arrays which I would like to support at some point. For that I will need:

support loading nd arrays from csv files formatted like "tables" (one
column per field). The only difference is that the last dimension is
not "expanded" horizontally.
support nd array of "record" arrays (if using the "table" format, this is
only a matter of "folding" the non record "dimension"/columns (ie checking
that all combination of values are present -- because I don't want to
support sparse arrays yet, that they are correctly sorted -- else I should
sort them)
support loading "tables" with the last "dimension" expanded horizontally
this means the "data" in those expanded columns is not named explicitly
(at least with the current format)
see person_salary.csv. This implies that the last column is a "dimension"
(ie, all rows have values for all the possible values for that field --
it can be the "missing" value though)

make an installer

the main goal is to have shortcuts created automatically. Even though cx_freeze can produce basic installers and works fine to produce the executable, their installer is not sufficient in our case since we need a shortcut to notepad, not to the liam2 simulation engine. I guess I will simply use NSIS

make "othertable" indexable by another column than PERIOD

a different "random sequence" for each random "user"

The goal is to prevent divergences of two simulations with the same seed from leaking to unrelated parts of the model through the position in the random sequence. Ideally, it should work even if one model use more sequences than the other in the middle of the simulation. That is, the sequence initial seed for one expression/user should not be based on the position/order of the expression in the simulation.

Here are our options:

dict based on the textual form of the expression?
- for align/logit_score/logit_regr, it would probably work fine,
- for uniform(), not so much because the "expression" is the same. Adding the sequence position in addition, would fix the "collision", but would introduce the "unrelated" changes problem when inserting a random "user" expression and this is exactly what we are trying to avoid in the first place.
optional explicit sequence name will be the most reliable solution but it is also the most verbose and annoying. It would not be that annoying if we only require an additional argument without requiring the user to declare the sequence beforehand.
the best option might be to use the position by default but skip any with an explicit name, so that we can set explicit names only for the "inserted" expressions (which is not present in both variants).

A few examples of random 'users': each alignment with frac_need=uniform, explicit call to a random number generator like uniform(), or logit_score/logit_regr.

propagate missing values for int columns

using floats everywhere should work (even for "id" and "link" columns (it would only limit the number of individuals per entity to 2^24 = ~16 millions for float32 or 2^53 = ~9 * 10^15 for float64)

rewrite the whole import process

load boolean fields as int8 (shouldn't need more work/conversion), then do a n-way merge, then do interpolation. If we haven't done the bool->int8 migration for the simulation code then we will also need to convert boolean fields back to booleans. In that case, we will have to copy the data using: b = i8 == 1, otherwise "missing" (-1) evaluates as True, which is not what we want in most case (and for backward compatibility)

refactor/redesign the clunky parts

the goals are:
- to make things testable by real unittests
- to have better separations of concepts / clearer "scopes" for each
  class
all entities in context
hide id_to_rownum
.array vs .table vs .lagged_array vs Entity vs EntityContext
vs full_context vs mapping passed to numexpr

.array doesn't need to be entirely in memory, .table
does not need to be on disk, but we need separate concepts in some places
because they need to support different functionalities: enlarge, remove,
add column, delete column for .array and only .append for .table

interactive console on failed asserts

as an option, of course

better document csv format

-- for missing
anything else?

check/prevent duplicate ids during import

implement inline dataset loading

I want to be consistent between align & select_link and having to import all alignment files would be a pain

    MIG: load('param\mig.csv', type=float, cached=True)
    othertable: load('param\othertable.csv', fields=[('INTFIELD', int),
                                                     ('FLOATFIELD', float)])
    periodic: load('param\globals.csv', transposed=True, fields='auto')
    periodic: load('param\globals.csv', transposed=True, fields=[
         (CPI, float),
         (FWEMRA, float),
         ...
         (ERA_R, float)])

Note that for periodic globals, I am unsure they need to be able to load them inline but being able to load them at run time instead of at import time would be great.

We need the cached=True option to specify whether the result can be cached or it should be (re)loaded each period: if used to exchange data with another program, it needs to load each period. But for some cases (most notably alignment), the same data can be kept in-memory. In the standard case, we could get away with only checking the modification time of the file. In that case, the user wouldn't need to specify cached=xxx, except in very special cases.

Idea: for "constant" arrays, we need a way to specify which temporary variables should not be dropped at the end of the period so that we can do the load in init. In that case, declaring an explicit dataset would probably be simpler.