The bayeslite from probcomp

expression USING MODELS

SELECT m.modelno
    FROM bayesdb_generator AS g, bayesdb_generator_model AS m
    WHERE g.name = 'foo_cc' AND g.id = m.generator_id
        AND (ESTIMATE MUTUAL INFORMATION OF quagga WITH eland FROM foo_cc
                USING MODEL m.modelno LIMIT 1) > 0.8

notation for predict-when-predictive-probability-is-too-low

We already have

IFNULL(x, PREDICT x WITH CONFIDENCE 0.9)

but it would be nice to have concise notation for

CASE WHEN PREDICTIVE PROBABILITY OF x < 0.1
    THEN PREDICT x WITH CONFIDENCE 0.9
    ELSE x

E.g.:

IFIMPREDICTIVE(x, 0.1, PREDICT x WITH CONFIDENCE 0.9)

.describe columns <table with default geneator> does not work

I have a table called sat with a default generator satcc. Trying

bayeslite> .describe columns sat

gives

No such generator: sat

.read without argument kills the bayeslite shell

When entering ".read" without arguments, the bayeslite shell freaks out and terminates. Likely related issue (probcomp/bdbcontrib#1)

To reproduce:

fsaad@fsaad-xps:~/Documents/pcp/crime/src$ bayeslite -m
Welcome to the Bayeslite shell.
Type `.help' for help.
bayeslite> .read
Traceback (most recent call last):
File "/usr/local/bin/bayeslite", line 5, in
pkg_resources.run_script('bayeslite==0.1.dev', 'bayeslite')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 528, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1401, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/EGG-INFO/scripts/bayeslite", line 22, in

File "build/bdist.linux-x86_64/egg/bayeslite/shell/main.py", line 90, in main
File "build/bdist.linux-x86_64/egg/bayeslite/shell/main.py", line 84, in run
File "build/bdist.linux-x86_64/egg/bayeslite/shell/core.py", line 91, in cmdloop
File "/usr/lib/python2.7/cmd.py", line 142, in cmdloop
stop = self.onecmd(line)
File "/usr/lib/python2.7/cmd.py", line 221, in onecmd
return func(arg)
File "build/bdist.linux-x86_64/egg/bayeslite/shell/core.py", line 184, in dot_read
IndexError: list index out of range
fsaad@fsaad-xps:~/Documents/pcp/crime/src$ < -- bayeslite exits here -- >

Crosscat chokes if you insert rows into non-subsampled tables

Should probably just do away with the new bayesdb_crosscat_subsampled table and consider them all as subsampled, so that insertion into a table won't cause subsequent attempts to use Crosscat to get confused by rows it hasn't been told about.

(Deletion is a separate issue, not yet considered at all.)

full SQL DDL in BQL

rename and expose sqlite3_quote_name

Python multiprocessing ignores ^C altogether

This means the bayeslite shell is not responsive to interruption during analysis and other Crosscat computations when you don't pass `-j 1'.

This is different from #25 because it is specifically about the multiprocessing module ignoring ^C so that nothing happens, not about data structures potentially getting corrupted on ^C.

import SQL UNION into BQL

automatic tests for schema upgrades

checkpointing is too slow?

vkm says something about checkpointing is too slow. Investigate and fix.

randomly choose subsamples

Subsampling is currently always done on the first n rows of the table. The selection should be (pseudo)randomly chosen, perhaps by seeding a PRNG with the number of rows in the table.

support renaming columns in a table

Python is not ^C-safe

Quoth the Python 2.7.10 manual, Sec. 17.4 `signal -- Set handlers for asynchronous events', at https://docs.python.org/2/library/signal.html:

There is no way to “block” signals temporarily from critical sections (since this is not supported by all Unix flavors).

Apparently Python is intentionally broken on every platform simply because some unspecified, and probably irrelevant ancient idiotic commercial, Unix failed (in some version) to provide a thirty-year-old POSIX API that everyone else implements.

So ^C may corrupt internal data structures and there's no way around it without calling out to C, which we don't really want to do.

IntegrityError: CHECK Constraint Failed

I am receiving an error when running ANALYZE:
IntegrityError: CHECK constraint failed: bayesdb_crosscat_diagnostics

What I am doing is running ANALYZE for 5 iterations, getting metadata, ANALYZE for 5 iterations, etc. Here is the bayesedb_crosscat_diagnostics table. An interesting thing is that this being thrown at different points in time each time the code is run (for example this time happened on iteration 60)

bql = '''
SELECT * FROM {};
'''.format(sqlite3_quote_name('bayesdb_crosscat_diagnostics'))
pprint(bdb.execute(bql))

generator_id | modelno | checkpoint | -------------+---------+------------+------- 1 | 0 | 0 | -216.632227271 | 1 | 0 | 1 | -230.775506274 | 1 | 0 | 2 | -170.147162457 | 1 | 0 | 3 | -243.814644297 | 1 | 0 | 4 | -142.679489762 | 1 | 0 | 5 | -125.942376049 | 1 | 0 | 6 | -2.55849991955 | 1 | 0 | 7 | -110.202015803 | 1 | 0 | 8 | -75.3602733849 | 1 | 0 | 9 | -145.83190593 | 1 | 0 | 10 | -153.820838185 | 1 | 0 | 11 | -0.677931099905 | 1 | 0 | 12 | -287.572541798 | 1 | 0 | 13 | -377.227855117 | 1 | 0 | 14 | -350.035549552 | 1 | 0 | 15 | -351.191314923 | 1 | 0 | 16 | -164.336211274 | 1 | 0 | 17 | -94.3282930816 | 1 | 0 | 18 | -150.357486169 | 1 | 0 | 19 | -48.7706536482 | 1 | 0 | 20 | -135.328798613 | 1 | 0 | 21 | -162.967499709 | 1 | 0 | 22 | -472.121060329 | 1 | 0 | 23 | -329.422583699 | 1 | 0 | 24 | -122.140203063 | 1 | 0 | 25 | -192.997961937 | 1 | 0 | 26 | -198.48325827 | 1 | 0 | 27 | -251.100789673 | 1 | 0 | 28 | -112.121011838 | 1 | 0 | 29 | -570.553730555 | 1 | 0 | 30 | -346.332542774 | 1 | 0 | 31 | -292.580269636 | 1 | 0 | 32 | -263.291469867 | 1 | 0 | 33 | -337.162212569 | 1 | 0 | 34 | -246.205950265 | 1 | 0 | 35 | -8.10034750723 | 1 | 0 | 36 | -459.609534473 | 1 | 0 | 37 | -156.440127524 | 1 | 0 | 38 | -128.375654082 | 1 | 0 | 39 | -133.783995676 | 2 | 0 | 0 | -392.463759514 | 2 | 0 | 1 | -266.343128727 | 2 | 0 | 2 | -151.979768066 | 2 | 0 | 3 | -114.133678948 | 2 | 0 | 4 | -143.167193832 | 2 | 0 | 5 | -266.432319956 | 2 | 0 | 6 | -411.509117512 | 2 | 0 | 7 | -433.672710939 | 2 | 0 | 8 | -242.74349302 | 2 | 0 | 9 | -214.089070013 | 2 | 0 | 10 | -91.0669648272 | 2 | 0 | 11 | -239.484416973 | 2 | 0 | 12 | -210.650209045 | 2 | 0 | 13 | -276.782743637 | 2 | 0 | 14 | -102.346233173 | 2 | 0 | 15 | -80.5544142058 | 2 | 0 | 16 | -90.0498130277 | 2 | 0 | 17 | -106.330405807 | 2 | 0 | 18 | -119.367098513 | 2 | 0 | 19 | -286.046686112 | 2 | 0 | 20 | -141.787008903 | 2 | 0 | 21 | -160.594241456 | 2 | 0 | 22 | -125.921822049 | 2 | 0 | 23 | -179.190501136 | 2 | 0 | 24 | -139.447279305 | 2 | 0 | 25 | -141.280506734 | 2 | 0 | 26 | -225.566772209 | 2 | 0 | 27 | -346.16301414 | 2 | 0 | 28 | -313.220051524 | 2 | 0 | 29 | -387.756695884 | 2 | 0 | 30 | -247.281185006 | 2 | 0 | 31 | -297.860855227 | 2 | 0 | 32 | -85.2586449005 | 2 | 0 | 33 | -159.604461696 | 2 | 0 | 34 | -368.991242153 | 2 | 0 | 35 | -72.360628912 | 2 | 0 | 36 | -76.7555482238 | 2 | 0 | 37 | -213.707345621 | 2 | 0 | 38 | -447.327913107 | 2 | 0 | 39 | -301.325092098 | 3 | 0 | 0 | -552.486469194 | 3 | 0 | 1 | -175.471334837 | 3 | 0 | 2 | -101.678849561 | 3 | 0 | 3 | -305.52631536 | 3 | 0 | 4 | -155.762257367 | 3 | 0 | 5 | -195.78558442 | 3 | 0 | 6 | -95.3502070188 | 3 | 0 | 7 | -145.975815597 | 3 | 0 | 8 | -96.1243975756 | 3 | 0 | 9 | -142.369176692 | 3 | 0 | 10 | -106.100206351 | 3 | 0 | 11 | -166.294146291 | logscore | num_views | column_crp_alpha | iterations
----------+-----------+------------------+-----------
4 | 1.78570188641 | 5
4 | 1.93991037446 | 10
3 | 5.69412336752 | 15
4 | 2.28942848511 | 20
3 | 2.10743589934 | 25
3 | 1.39280665365 | 30
2 | 1.39280665365 | 35
4 | 2.28942848511 | 40
3 | 1.78570188641 | 45
3 | 2.70192007704 | 50
3 | 1.93991037446 | 55
2 | 1.08635735294 | 60
4 | 6.18585278886 | 65
5 | 2.28942848511 | 70
5 | 2.93525074276 | 75
5 | 4.08823676465 | 80
4 | 1.78570188641 | 85
3 | 1.08635735294 | 90
3 | 1.28208885399 | 95
3 | 2.10743589934 | 100
3 | 2.28942848511 | 105
3 | 1.78570188641 | 110
5 | 4.8248237785 | 115
4 | 1.08635735294 | 120
3 | 1.08635735294 | 125
4 | 2.70192007704 | 130
4 | 2.93525074276 | 135
4 | 2.48713746883 | 140
3 | 1.78570188641 | 145
7 | 6.72004666139 | 150
5 | 3.46410161514 | 155
4 | 2.10743589934 | 160
4 | 4.08823676465 | 165
5 | 4.08823676465 | 170
4 | 2.10743589934 | 175
2 | 1.28208885399 | 180
5 | 3.46410161514 | 185
3 | 1.51308574942 | 190
3 | 1.78570188641 | 195
3 | 1.93991037446 | 200
3 | 1.08635735294 | 5
4 | 1.18017229829 | 10
3 | 1.64375182952 | 15
3 | 1.08635735294 | 20
3 | 1.51308574942 | 25
4 | 1.93991037446 | 30
6 | 1.39280665365 | 35
6 | 6.18585278886 | 40
4 | 4.8248237785 | 45
4 | 2.70192007704 | 50
3 | 2.10743589934 | 55
4 | 1.08635735294 | 60
4 | 1.78570188641 | 65
4 | 1.64375182952 | 70
3 | 1.18017229829 | 75
3 | 2.70192007704 | 80
3 | 1.0 | 85
3 | 1.78570188641 | 90
3 | 1.08635735294 | 95
4 | 3.18873122712 | 100
3 | 1.93991037446 | 105
3 | 1.51308574942 | 110
3 | 1.93991037446 | 115
4 | 1.39280665365 | 120
3 | 1.93991037446 | 125
3 | 1.39280665365 | 130
4 | 1.51308574942 | 135
5 | 3.76325226094 | 140
5 | 5.69412336752 | 145
7 | 12.0 | 150
5 | 3.18873122712 | 155
5 | 4.08823676465 | 160
3 | 1.78570188641 | 165
4 | 2.48713746883 | 170
5 | 2.10743589934 | 175
3 | 2.48713746883 | 180
3 | 1.18017229829 | 185
4 | 1.18017229829 | 190
6 | 4.08823676465 | 195
5 | 1.08635735294 | 200
4 | 1.28208885399 | 5
3 | 5.24148278842 | 10
3 | 2.10743589934 | 15
4 | 2.48713746883 | 20
3 | 1.39280665365 | 25
4 | 1.93991037446 | 30
3 | 1.08635735294 | 35
4 | 1.93991037446 | 40
3 | 1.64375182952 | 45
3 | 4.44128606985 | 50
3 | 1.64375182952 | 55
4 | 1.51308574942 | 60

IntegrityError Traceback (most recent call last)
/home/fsaad/Documents/pcp/bayeslite/experiments/exp_hyperparams.py in ()
195 args['n_samples'] = samples
196 args['dataset'] = np.asarray(sdata[0][:samples])
--> 197 result = runner(args)
198 plot(result)

/home/fsaad/Documents/pcp/bayeslite/experiments/exp_hyperparams.py in runner(args)
113 results = {}
114 results['args'] = args
--> 115 results['hypers'] = train_models(args)
116
117 return results

/home/fsaad/Documents/pcp/bayeslite/experiments/exp_hyperparams.py in train_models(args)
90 ANALYZE {} FOR {} ITERATIONS WAIT;
91 '''.format(generator, args['step_size'])
---> 92 bdb.execute(bql)
93
94 generator_id = core.bayesdb_get_generator(bdb, generator)

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/bayesdb.pyc in execute(self, string, bindings)
147 else:
148 raise ValueError('>1 phrase in string')
--> 149 return bql.execute_phrase(self, phrase, bindings)
150
151 def sql_execute(self, string, bindings=None):

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/bql.pyc in execute_phrase(bdb, phrase, bindings)
552 max_seconds=phrase.seconds,
553 ckpt_iterations=phrase.ckpt_iterations,
--> 554 ckpt_seconds=phrase.ckpt_seconds)
555 return empty_cursor(bdb)
556

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/crosscat.pyc in analyze_models(self, bdb, generator_id, modelnos, iterations, max_seconds, ckpt_iterations, ckpt_seconds)
833 'column_crp_alpha':
834 diagnostics['column_crp_alpha'][-1][i],
--> 835 'iterations': theta['iterations'],
836 })
837 if cc_cache is not None:

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/bayesdb.pyc in sql_execute(self, string, bindings)
164 if self.sql_tracer:
165 self.sql_tracer(string, bindings)
--> 166 return self.sqlite3.execute(string, bindings)
167
168 @contextlib.contextmanager

IntegrityError: CHECK constraint failed: bayesdb_crosscat_diagnostics

autogenerate BQL documentation from grammar.y

make `COLUMNS OF <generator>' a first-class table in BQL

BQL savepoints

document crosscat sqlite3 schema

ESTIMATE IN

If you run

ESTIMATE MUTUAL INFORMATION OF quagga WITH eland FROM foo,

you'll get a one-column table of mutual informations, recomputed once for each row in foo. This is silly. But you can't just write

ESTIMATE MUTUAL INFORMATION OF quagga WITH eland

because there's no table/generator specified.

There should be a way to evaluate a single row-independent BQL function for a generator, say:

ESTIMATE MUTUAL INFORMATION OF quagga WITH eland IN foo

An alternative would be to name the generator in the BQL function, as in

ESTIMATE (MUTUAL INFORMATION OF quagga WITH eland IN foo)
    + (PREDICTIVE PROBABILITY OF serpent WITH basilisk IN bar)

but while that is a more appealing language expression composition, it is likely not what most users want to write; I expect ESTIMATE IN to suffice in practically all cases.

add time-based, not iteration-based, checkpointing

SIMULATE can't be used in a SELECT to generate multiple output rows from a single input row

Someone might want to draw from a kind of fibre product -- for each model satisfying a certain predicate, simulate a bunch of answers to obtain rows of the form

(<modelno>, <simulated col0>, <simulated col1>, <simulated col2>, ...)

It is tempting to try to write something like

SELECT m.modelno, (SIMULATE col0, col1, col2 FROM t USING MODEL m.modelno LIMIT 10)
    FROM MODELS OF t AS m
    WHERE (ESTIMATE MUTUAL INFORMATION OF col8 WITH col1 IN t USING MODEL m.modelno) > 0.8

but SQL has no concept of generating multiple output rows from a single input row. In that context, only the first column of the first row of the SIMULATE will be relevant, because it is a scalar context.

tutorial with shell and example analysis

Interactive tutorial.
Long-running background analysis.
Crosscat diagnostics.
Document-compile-time assertions in tutorial
- E.g., most weird satellite by lifetime ought to be the ISS.
Suggest numbers fast enough to be responsive for:
- <=10,000 rows
- <=100 columns
- <=10-20 views
- <=100 categories
- <=10 models
- <=200 iterations

rows in BQL functions as variables

SELECT c.name, SIMILARITY OF c TO s
    FROM city AS c, city AS s
    WHERE s.name = c.sister

provide a way to name models

SELECT t.(ESTIMATE COLUMNS ...)

require explicit `--volatile' argument if no bdb file specified

implement DEPENDENT/INDEPENDENT for crosscat generators

CREATE GENERATOR foo_cc FOR foo USING crosscat (
    GUESS(*),
    INDEPENDENT(mumble, frotz),
    DEPENDENT(quagga, eland, caribou)
)

Requires support in Crosscat:

probcomp/crosscat#33

BQL quick reference

index exception when ANALYZING table with one column

I create a csv file with one column of data named ("c0") and 20 values. When I run ANALYZE, an exception in thrown

Here is a reproducible example of the procedure:

import bayeslite
import bayeslite.crosscat
import numpy as np
import math
import random
import sys

from crosscat.MultiprocessingEngine import MultiprocessingEngine
from bayeslite.shell.pretty import pp_cursor

def pprint(cursor):
    return pp_cursor(sys.stdout, cursor)

if __name__ == '__main__':
    # create one column of data, save to data.csv, with header c0
    t = 20
    data = np.random.rand(t)
    data = data.reshape(len(data),1)
    np.savetxt('data.csv', data, header='c0', comments='')

    btable = "table{}".format(t)
    generator = "table{}_cc".format(t)

    bdb = bayeslite.bayesdb_open()
    engine = bayeslite.crosscat.CrosscatMetamodel(
        MultiprocessingEngine())
    bayeslite.bayesdb_register_metamodel(bdb, engine)
    bayeslite.bayesdb_read_csv_file(bdb, btable, "data.csv",
                                    header=True, create=True)

    bql = '''
    SELECT * FROM {}
    '''.format(btable)
    c = bdb.execute(bql)
    pprint(c)


    bql = '''
    CREATE GENERATOR {} FOR {}
        USING crosscat (
           c0 NUMERICAL
        );
    '''.format(generator, btable)
    bdb.execute(bql)

    # exception thrown in the following call
    bql = '''
    INITIALIZE {} MODELS FOR {};
    '''.format(10, generator)
    bdb.execute(bql)

    bql = '''
    ANALYZE {} for {} ITERATIONS WAIT;
    '''.format(generator, 10)
    bdb.execute(bql)

    bql = '''
    CREATE TEMP TABLE simres AS
        SIMULATE c0 FROM {}
        LIMIT {};
    '''.format(generator, 15)
    bdb.execute(bql)

    bql = 'SELECT * FROM simres;'
    simdata = None
    with bdb.savepoint():
        c = bdb.execute(bql)
        simdata = np.array(c.fetchall())

And here is the stack trace:

In [19]: run one_col.py
             c0
---------------
  0.21819395493
 0.930373567089
 0.725379439808
 0.691447842751
 0.261562572085
 0.948943970262
  0.46605176487
0.0151432877238
 0.441854759811
 0.665655889346
0.0765081395686
 0.447978645136
 0.825578309208
 0.500403070452
 0.658746843184
 0.843358329166
 0.248048357726
  0.79623218477
 0.526216988005
 0.875729646947
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/fsaad/Documents/pcp/bayeslite/experiments/one_col.py in <module>()
     55     ANALYZE {} for {} ITERATIONS WAIT;
     56     '''.format(generator, 10)
---> 57     bdb.execute(bql)
     58 
     59     bql = '''

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/bayesdb.pyc in execute(self, string, bindings)
    149         if more:
    150             raise ValueError('>1 phrase in string')
--> 151         return bql.execute_phrase(self, phrase, bindings)
    152 
    153     def sql_execute(self, string, bindings=None):

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/bql.pyc in execute_phrase(bdb, phrase, bindings)
    553             max_seconds=phrase.seconds,
    554             ckpt_iterations=phrase.ckpt_iterations,
--> 555             ckpt_seconds=phrase.ckpt_seconds)
    556         return empty_cursor(bdb)
    557 

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/crosscat.pyc in analyze_models(self, bdb, generator_id, modelnos, iterations, max_seconds, ckpt_iterations, ckpt_seconds)
    588                         X_L=X_L_list,
    589                         X_D=X_D_list,
--> 590                         n_steps=n_steps,
    591                     )
    592                     if iterations is not None:

/usr/local/lib/python2.7/dist-packages/CrossCat-0.1.8-py2.7-linux-x86_64.egg/crosscat/LocalEngine.pyc in analyze(self, M_c, T, X_L, X_D, kernel_list, n_steps, c, r, max_iterations, max_time, do_diagnostics, diagnostics_every_N, ROW_CRP_ALPHA_GRID, COLUMN_CRP_ALPHA_GRID, S_GRID, MU_GRID, N_GRID, do_timing, CT_KERNEL)
    267             diagnostics_dict = munge_diagnostics(diagnostics_dict_list)
    268             if reprocess_diagnostics_func is not None:
--> 269                 diagnostics_dict = reprocess_diagnostics_func(diagnostics_dict)
    270             ret_tuple = ret_tuple + (diagnostics_dict, )
    271         if do_timing:

/usr/local/lib/python2.7/dist-packages/CrossCat-0.1.8-py2.7-linux-x86_64.egg/crosscat/utils/diagnostic_utils.pyc in default_reprocess_diagnostics_func(diagnostics_arr_dict)
     50     # column_paritition_assignments are column, iter, chain
     51     D = column_partition_assignments.shape[0] - 1
---> 52     f_z_statistic_0_1 = column_partition_assignments_to_f_z_statistic(column_partition_assignments, 1, 0)
     53     f_z_statistic_0_D = column_partition_assignments_to_f_z_statistic(column_partition_assignments, D, 0)
     54     diagnostics_arr_dict['f_z[0, 1]'] = f_z_statistic_0_1

/usr/local/lib/python2.7/dist-packages/CrossCat-0.1.8-py2.7-linux-x86_64.egg/crosscat/utils/diagnostic_utils.pyc in column_partition_assignments_to_f_z_statistic(column_partition_assignments, j, i)
     43     iter_column_chain_arr = column_partition_assignments.transpose((1, 0, 2))
     44     helper = lambda column_chain_arr: column_chain_to_ratio(column_chain_arr, j, i)
---> 45     as_list = map(helper, iter_column_chain_arr)
     46     return numpy.array(as_list)[:, numpy.newaxis]
     47 

/usr/local/lib/python2.7/dist-packages/CrossCat-0.1.8-py2.7-linux-x86_64.egg/crosscat/utils/diagnostic_utils.pyc in <lambda>(column_chain_arr)
     42         j, i=0):
     43     iter_column_chain_arr = column_partition_assignments.transpose((1, 0, 2))
---> 44     helper = lambda column_chain_arr: column_chain_to_ratio(column_chain_arr, j, i)
     45     as_list = map(helper, iter_column_chain_arr)
     46     return numpy.array(as_list)[:, numpy.newaxis]

/usr/local/lib/python2.7/dist-packages/CrossCat-0.1.8-py2.7-linux-x86_64.egg/crosscat/utils/diagnostic_utils.pyc in column_chain_to_ratio(column_chain_arr, j, i)
     32 
     33 def column_chain_to_ratio(column_chain_arr, j, i=0):
---> 34     chain_i_j = column_chain_arr[[i, j], :]
     35     is_same = numpy.diff(chain_i_j, axis=0)[0] == 0
     36     n_chains = len(is_same)

IndexError: index 1 is out of bounds for axis 0 with size 1

deletion from modelled tables

Need a Crosscat story for this. Also may need a story about non-contiguous row ids.

progress notification for long queries

There is partial support for a progress hook, but it is not called everywhere it should be. It should be invoked at every iteration of a metamodel function, and perhaps sometimes in the metamodel itself so that long-running metamodel functions (e.g., ANALYZE) do not block application responsiveness.

document bayesdb sqlite3 schema

insertion into modelled tables

Need some way in BQL to inform the metamodel of newly inserted rows in the table, after analysis on an initial subsample. We already have bayesdb_insert and bayesdb_insertmany, but they are difficult to use -- you must identify the rows that you inserted, and make sure you call bayesdb_insert(many) before you do any BQL functions on them.

quick'n'dirty subsampling

Initialize Crosscat on single random subset of real data for all models.
For BQL queries on rows not known to Crosscat, use hypotheticals.
- No need to manifest insertion into Crosscat in BQL for now.

Jenkins build OS X traditional dmg with prebuilt self contained .app

common table expressions

WITH t AS (SELECT ...); SELECT ...

Requires a newer version of sqlite3 than is in Ubuntu 14.04.

SIMULATE can't refer to enclosing scope

SELECT (SIMULATE x FROM t USING MODEL m.modelno LIMIT 1)
    FROM bayesdb_generator AS g JOIN bayesdb_generator_model ON (g.id = m.generator_id)
        WHERE (ESTIMATE MUTUAL INFORMATION OF foo WITH bar FROM t USING MODEL m.modelno) > 0.8

doesn't work because the SIMULATE is executed first to pre-generate a temporary table, before any value of m.modelno is determined by executing the enclosing query.

track lexical scope in compiler

The BQL->SQL compiler is currently unable to ascertain that in

CREATE TABLE t(x, y);
CREATE TABLE u(z, w);
SELECT y, w FROM t, u WHERE x = z

the x and y come from t and the z and w come from u. This has various consequences:

No helpful error messages from us.
Names are wrapped in "foo" in SQL output in case they have special characters -- but sqlite3 idiocy reinterprets "foo" as a string, like 'foo', when it doesn't make sense in context as a column name, so if you mistype a column name you get a column of the constant strings of your typo.
INFER can't automatically tag the relevant column names in nested expressions with IFNULL(x, PREDICT x WITH CONFIDENCE c), so instead we reject nested expressions in INFER (without INFER EXPLICIT).

The compiler should be taught to track lexical environments so it can do all these things and more.

design and implement foreign predictors

Still not clear on what these are other than metamodels with only INITIALIZE/ANALYZE and PREDICT/SIMULATE, not MUTUAL INFORMATION or anything else like that.

Need some illustrative examples to generalize from.

type-check queries

ESTIMATE PAIRWISE ROW SIMILARITY WITH RESPECT TO
        (SELECT * FROM sqlite_master)
    FROM t_cc

makes no sense. The compiler should detect that the inner query is not an ESTIMATE COLUMNS, nor a SELECT on an ESTIMATE COLUMNS, and reject it.

bayeslite should support some kind of n-fold cross validation

`bayeslite :memory:' fails to create a file called `:memory:'

Need to work on quoting issues in passing arguments to sqlite3. Should use an explicit URI here, but I think the Python sqlite3 module might not support it (unlike the more sensible apsw alternative).

proper relational realization of USING MODEL

Design a composable, relational realization of MODELS OF <generator> so that you can do queries something like:

SELECT m.modelno FROM MODELS OF foo_cc AS m
    WHERE (ESTIMATE MUTUAL INFORMATION OF quagga WITH eland
            IN foo_cc USING MODEL m) > 0.8

dumb USING MODELS

ESTIMATE PAIRWISE MUTUAL INFORMATION FROM t_cc USING MODELS 3,4

bayeslite pythenv.sh doesn't always override system-wide bayeslite installation with virtualenv --system-site-packages

check.sh, via pythenv.sh, sets PYTHONPATH so that the bayeslite build directory is first, in an attempt to override any other bayeslite installation on the system, whether in /usr or in a virtualenv or in the caller's PYTHONPATH, when running the tests.

But if bayeslite is installed system-wide with

python setup.py build
python setup.py install

then ./check.sh uses the system-wide one, not the local one.

If bayeslite is installed system-wide with a Debian package, then ./check.sh in a virtualenv with --system-site-packages uses the system-wide one, not the local one. But outside a virtualenv it works fine.

Whisky tango foxtrot, Python?

add kludgey .backup command to bdbcontrib

Run in a subprocess:

echo .backup foo0.bdb | sqlite3 foo.bdb

user/program interface for schema upgrades

Prior to release, automatically upgrading the schema for databases generated by pre-release versions was OK, because everyone is using the most recent version from Git. Once we release, it will be necessary to make upgrading the database schema an explicit action so that when you use Bayeslite 2.0, you don't break the databases of users who are still on Bayeslite 1.3.

initializing a table with just the header, then adding a row, causes failure in model initialization

The code below uses helper functions pprint, and do_query from here

To reproduce the issue

first create a bayesdb table with just the headers

header = ['State,Murder,Assault,Rape,Population,Income,Illiteracy,LifeExp,HSGrad,Frost,GDP,Minority,LiveAlone,Divorce,Geo\n']
bayeslite.bayesdb_read_csv(bdb, btable, iter(header), header=True, create = True)

confirm that the header has been loaded into the table

bql = """
SELECT * FROM states;
"""
c = do_query(bdb, bql)
pprint(c)

bayeslite > SELECT * FROM states;
State | Murder | Assault | Rape | Population | Income | Illiteracy | LifeExp | HSGrad | Frost | GDP | Minority | LiveAlone | Divorce | Geo
------+--------+---------+------+------------+--------+------------+---------+--------+-------+-----+----------+-----------+---------+----

now create the generators and specify the `stattypes`

# create the generator
bql = """
CREATE GENERATOR states_cc FOR states
    USING crosscat (
        GUESS(*),
        State IGNORE,
        Murder NUMERICAL,
        Assault NUMERICAL,
        Rape NUMERICAL,
        Population NUMERICAL,
        Income NUMERICAL,
        Illiteracy NUMERICAL,
        LifeExp NUMERICAL,
        HSGrad NUMERICAL,
        Frost NUMERICAL,
        GDP NUMERICAL,
        Minority NUMERICAL,
        LiveAlone NUMERICAL,
        Divorce NUMERICAL,
        Geo CATEGORICAL
    );
"""
c = do_query(bdb, bql)

# inspect the generator column stattypes
generator_id = core.bayesdb_get_generator(bdb, generator)
sql = '''
SELECT c.colno AS colno, c.name AS name,
        gc.stattype AS stattype, c.shortname AS shortname
    FROM bayesdb_generator AS g,
        (bayesdb_column AS c LEFT OUTER JOIN
            bayesdb_generator_column AS gc
            USING (colno))
    WHERE g.id = ? AND g.id = gc.generator_id;
'''
c = bdb.sql_execute(sql, (generator_id,))
pprint(c)

colno |       name |    stattype | shortname
------+------------+-------------+----------
    1 |     Murder |   numerical |      None
    2 |    Assault |   numerical |      None
    3 |       Rape |   numerical |      None
    4 | Population |   numerical |      None
    5 |     Income |   numerical |      None
    6 | Illiteracy |   numerical |      None
    7 |    LifeExp |   numerical |      None
    8 |     HSGrad |   numerical |      None
    9 |      Frost |   numerical |      None
   10 |        GDP |   numerical |      None
   11 |   Minority |   numerical |      None
   12 |  LiveAlone |   numerical |      None
   13 |    Divorce |   numerical |      None
   14 |        Geo | categorical |      None

Now add a row

row = ['New Mexico,9.7,285,32.1,1144,3601,2.2,70.32,55.2,120,14.3,9.72,14.91,7.3,West\n']
bayeslite.bayesdb_read_csv(bdb, btable, iter(row), header=False, create = False)

confirm that the row is in the table

bayeslite > SELECT * FROM states;

State | Murder | Assault | Rape | Population | Income | Illiteracy | LifeExp | HSGrad | Frost |  GDP | Minority | LiveAlone | Divorce |  Geo
-----------+--------+---------+------+------------+--------+------------+---------+--------+-------+------+----------+-----------+---------+-----
New Mexico |    9.7 |     285 | 32.1 |       1144 |   3601 |        2.2 |   70.32 |   55.2 |   120 | 14.3 |     9.72 |     14.91 |     7.3 | West


### Now do the unthinkable and try to initialize model, boom!
```python
c = do_query(bdb, 'INITIALIZE 10 MODELS FOR states_cc;')
 INITIALIZE 10 MODELS FOR states_cc;
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    202             else:
    203                 filename = fname
--> 204             __builtin__.execfile(filename, *where)

/home/fsaad/Documents/pcp/crime/src/crimes_bdb.py in <module>()
    121 pprint(c)
    122 
--> 123 c = do_query(bdb, 'INITIALIZE 10 MODELS FOR states_cc;')
    124 
    125 # close the file buffer

/home/fsaad/Documents/pcp/crime/src/crimes_bdb.py in do_query(bdb, bql, bindings)
     36         bindings = ()
     37     print '--> ' + bql.lstrip()
---> 38     return bdb.execute(bql, bindings)
     39 
     40 btable = 'states'

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/bayesdb.pyc in execute(self, string, bindings)
    149         if more:
    150             raise ValueError('>1 phrase in string')
--> 151         return bql.execute_phrase(self, phrase, bindings)
    152 
    153     def sql_execute(self, string, bindings=None):

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/bql.pyc in execute_phrase(bdb, phrase, bindings)
    525             metamodel = core.bayesdb_generator_metamodel(bdb, generator_id)
    526             metamodel.initialize_models(bdb, generator_id, modelnos,
--> 527                 model_config)
    528         return empty_cursor(bdb)
    529 

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/crosscat.pyc in initialize_models(self, bdb, generator_id, modelnos, model_config)
    462             M_c=M_c,
    463             M_r=None,           # XXX
--> 464             T=self._crosscat_data(bdb, generator_id, M_c),
    465             n_chains=len(modelnos),
    466             initialization=model_config['initialization'],

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/crosscat.pyc in _crosscat_data(self, bdb, generator_id, M_c)
    183         return [[crosscat_value_to_code(bdb, generator_id, M_c, colno, value)
    184                 for value, (_name, colno) in zip(row, columns)]
--> 185             for row in cursor]
    186 
    187     def _crosscat_thetas(self, bdb, generator_id, modelno):

/usr/local/lib/python2.7/dist-packages/bayeslite-0.1.dev-py2.7.egg/bayeslite/crosscat.pyc in crosscat_value_to_code(bdb, generator_id, M_c, colno, value)
    991         cc_colno = crosscat_cc_colno(bdb, generator_id, colno)
    992         key = unicode(value)
--> 993         code = M_c['column_metadata'][cc_colno]['code_to_value'][key]
    994         # XXX Crosscat expects floating-point codes.
    995         return float(code)

NB: When we initialize the table with the header AND the first row, this issue does not happen, i.e.

bayeslite.bayesdb_read_csv(bdb, btable, iter(header+row), header=True, create = True)

probcomp / bayeslite Goto Github PK

bayeslite's People

Contributors

Stargazers

Watchers

Forkers

bayeslite's Issues

first create a bayesdb table with just the headers

confirm that the header has been loaded into the table

now create the generators and specify the stattypes

Now add a row

confirm that the row is in the table

Recommend Projects

Recommend Topics

Recommend Org

now create the generators and specify the `stattypes`