epistasislab / aliro Goto Github PK

Aliro: AI-Driven Data Science

Home Page: https://epistasislab.github.io/Aliro

License: GNU General Public License v3.0

Python 18.63% Jupyter Notebook 23.52% Shell 0.77% JavaScript 50.33% RAML 1.24% TypeScript 1.52% Dockerfile 0.58% CSS 0.31% HTML 3.10%

adsp ag066833 aiml alzheimer alzheimers alzheimers-disease automl nia u01ag066833

aliro's People

Contributors

Stargazers

Watchers

Forkers

djfunksalot dm2016 dgc0101 noverea lifanghe weklica ml-lab ashokohio jtyow1 deepaktiwari1 dhanyakumarmh weixuanfu keshavadilwali boubouhkarim liwei-cn battyone shirangi drroad yingbi92 based-god-fucked-my-bitch-fuckzig apilastri yuxiaomu zhaoqiuye sandy4321 frankfan007 trendingtechnology sachinmittal2212 sanyam07 dysartk richardshea yaxche-io carlosantagiustina vieslink fagan2888 weichen-huang yangchenghuang t-triobox mgarouani ossys pplonski lacava henriliiva hjwilli f3d3r1c00 viabard lxy1991 fbatroni melimore86 eshnil2000 streamlineai banzay1 seanigami osd1syc python-repository-hub sujianmoyun hyunjuna phymucs playerrrrr jay-m-dev nrohatgi g01denb0y

aliro's Issues

Refactor dataset management

Refactor how datasets are initialized, stored, accessed, and added.

Initial datasets should be loaded through a lab init process, not machine
Initial datasets should be moved from machine/datasets, perhaps to config/datasets
Don't generate 'metadata.json' files when datasets are uploaded
Investigate having datasets live in a shared docker volume instead of mongodb; mongodb would contain only the path
Generate metafeatures when datasets are initialized and add them to the database
Add api for dataset metafeatures
Add dataset details webpage that shows dataset metafeatures

Additional requirements to make the single dataset registration api call not rely on the dataset being in the app filesystem

Update ai/metalearning/get_metafeatures.py to use raw data instead of (or as well as) a file path
Update lab/metafeature.js to use raw data instead of a path
Update api such that a POST to api/v1/datasets does not need to specify a filepath
Dataset validation done before registering a dataset, specify how validation errors and warnings will be returned

cleanup pennai branches

rebase pennai_lite to be the new master
clean up other branches

Replace the node docker deployment script with a docker-compose config file

Replace the node scripts in /awsm/*.js which are responsible for rebuilding and starting docker containers with a docker compose file. Should simplify the install process and resource management.

Docker compose files seem to be supported by AWS and Azure as a means of configuring deployment.

recommender methods should be able to query the database directly for info (e.g. whether an ML+P combo has been run on a dataset) rather than accessing static files. @djfunksalot and I will work on this

Documentation for installing PennAI locally

@djfunksalot Please add detailed documentation about installing PennAI in local machine.

unit tests for recommender

write unit tests for recommender, including mock database for randomrecommender.

Add documentation about debugging

Clean up machine

Clean up node_modules in lab

node_modules is too large, ~1Gb. We may need some clean-ups

clean up ssh key in docker base

It is not safe. And this issue is related to #35.

Duplicate files are uploaded when an ml experiment completes

Duplicate files are uploaded when an ml experiment completes. Sample of a completed experiment object:

{ _id: '5b5b61cc5fd793003124a761',
  _options: 
   { criterion: 'gini',
     max_depth: 1,
     min_samples_split: 2,
     min_samples_leaf: 1 },
  _dataset_id: '5b5b61c65fd793003124a700',
  _project_id: '5b5b618b596f2e0a82562522',
  _machine_id: '5b5b61c05fd793003124a67c',
  username: 'testuser',
  files: 
   [ { _id: '5b5b61c65fd793003124a711',
       filename: 'adult.csv',
       mimetype: 'text/csv',
       timestamp: 1532715463639 },
     { _id: '5b5b61d45fd793003124a762',
       filename: 'model_5b5b61cc5fd793003124a761.pkl',
       mimetype: 'application/octet-stream',
       timestamp: 1532715476354 },
     { _id: '5b5b61d45fd793003124a763',
       filename: 'model_5b5b61cc5fd793003124a761.pkl',
       mimetype: 'application/octet-stream',
       timestamp: 1532715476365 },
     { _id: '5b5b61d65fd793003124a766',
       filename: 'scripts_5b5b61cc5fd793003124a761.py',
       mimetype: 'application/octet-stream',
       timestamp: 1532715478162 },
     { _id: '5b5b61d65fd793003124a767',
       filename: 'scripts_5b5b61cc5fd793003124a761.py',
       mimetype: 'application/octet-stream',
       timestamp: 1532715478163 },
     { _id: '5b5b61d65fd793003124a76a',
       filename: 'imp_score5b5b61cc5fd793003124a761.png',
       mimetype: 'image/png',
       timestamp: 1532715478612 },
     { _id: '5b5b61d65fd793003124a76b',
       filename: 'imp_score5b5b61cc5fd793003124a761.png',
       mimetype: 'image/png',
       timestamp: 1532715478612 },
     { _id: '5b5b61da5fd793003124a76f',
       filename: 'confusion_matrix_5b5b61cc5fd793003124a761.png',
       mimetype: 'image/png',
       timestamp: 1532715482656 },
     { _id: '5b5b61da5fd793003124a76e',
       filename: 'confusion_matrix_5b5b61cc5fd793003124a761.png',
       mimetype: 'image/png',
       timestamp: 1532715482657 },
     { _id: '5b5b61da5fd793003124a772',
       filename: 'roc_curve5b5b61cc5fd793003124a761.png',
       mimetype: 'image/png',
       timestamp: 1532715482986 },
     { _id: '5b5b61da5fd793003124a773',
       filename: 'roc_curve5b5b61cc5fd793003124a761.png',
       mimetype: 'image/png',
       timestamp: 1532715482988 } ],

Basic integration tests and a test runner

Currently there are some python unit tests written for the ai recommender and machine.

Create some basic integration tests and some sort of test runner that can run the integration tests and existing unit tests from the command line.

ai.py sometimes crashes when communicating with lab server

This behavior does not always occur or unsure how to reliably reproduce.

Some example lab log output for a failure that occurred after automatically starting the ai using common.env and then toggling the ai button gives the following in lab logs:

lab_1      | 2018 01:10:33 AM UTC : checking requests...
lab_1      | 2018 01:10:35 AM UTC : checking results...
lab_1      | requesting from : %s http://lab:5080/api/experiments
lab_1      | 2018 01:10:35 AM UTC : checking requests...
lab_1      | 2018 01:10:35 AM UTC : new ai request for: ['adult']
lab_1      | foo
lab_1      | Traceback (most recent call last):
lab_1      |   File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
lab_1      |     "__main__", mod_spec)
lab_1      |   File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
lab_1      |     exec(code, run_globals)
lab_1      |   File "/appsrc/ai/ai.py", line 460, in <module>
lab_1      |     main()
lab_1      |   File "/appsrc/ai/ai.py", line 446, in main
lab_1      |     if pennai.check_requests():
lab_1      |   File "/appsrc/ai/ai.py", line 193, in check_requests
lab_1      |     q_utils.startQ(self,r['_id'])
lab_1      |   File "/appsrc/ai/q_utils.py", line 33, in startQ
lab_1      |     thread = datasetThread(d,self)
lab_1      |   File "/appsrc/ai/q_utils.py", line 16, in __init__
lab_1      |     self.name = p.user_datasets[threadID]
lab_1      | KeyError: '5b6259eb509e960031bf9fae'
lab_1      | [TAILING] Tailing last 15 lines for [all] processes (change the value with --lines option)
lab_1      | /root/.pm2/pm2.log last 15 lines:
lab_1      | PM2        | [2018-08-02T01:09:24.112Z] PM2 log: PM2 version          : 3.0.3
lab_1      | PM2        | [2018-08-02T01:09:24.112Z] PM2 log: Node.js version      : 6.12.3

Error: HTTP request taking too long

I followed the instructions to build the project via docker-compose. I ran into three issues:

the apt package for docker-compose needs to be installed on ubuntu 16. on running docker-compose build, it complains that the version in the yaml file (3) is too high. i had to switch it to 2 to build.
i get this error during docker-compose up

ERROR: compose.cli.errors.log_timeout_error: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.

the docker containers seem to start up, but i cannot attach to pennai_lab, and if i try it breaks the container.

Remove symbolic link in io_utils.py

In each experiment, machine\learn\io_utils.py will request dataset file via API and then save a copy into temporary folder then create a symbolic link pointing to the copy. I think it waste storage in tmp space when running a lot of experiment. So I will remove the symbolic links and copy of dataset in temporary folder and make io_utils.py return pandas. dataframe directly in each experiment.

export models

for each dataset experiment, export model to mongo so that it can be returned as python code.

Modify update method of recommender to take meta-features

The size of metafeatures need to be the same as results_data

More unit tests in Machine and move test codes to machine folder

Decide on an open source license for the project

I suggest MIT but I'm open to other open source licenses. We should do this ASAP.

Documentation for installing PennAI into AWS

We need detailed documentation about how to install PennAI into AWS.

Related to #25

continuous integration

add test suite
add travis hooks

mockup experiment with pmlb results

mock up the experiment with pmlb cached results
generate figure comparing random and average recommenders over several trials of 150 datasets
generate heatmap figure that shows how recommendations change over time to different ML algorithms

Add class_weight in machine learning methods

Need set class_weight = 'balanced' for those ML methods in machine.

Clean up /lab/examples/Algorithms/

Remove /lab/examples/Algorithms/ and refine Experiment class in io_utils.py.

In machine/learn/io_utils.py, data about the available algorithms(projects) is being pulled from /lab/examples/Algorithms/* (look for the phraseself.schema = basedir + '/lab/examples/Algorithms/'...).

Machine should not be referencing data files in /lab, and this information is redundant to what is loaded in the projects folder of the database (initially loaded from /dockers/dbmongo/files/projects.json) which is being used in other parts of the application (for example to see which algorithms are valid when a machine is registering it's algorithms with the lab api). Machine should be getting this data from the lab api.

Make sure that /dockers/dbmongo/files/projects.json up to date and is current with the contents of /lab/examples/Algorithms/
Make sure there is a lab api method that can return project details
Update machine/learn/io_utils.py to use the lab api instead of the /examples folder
Remove the examples folder

Might also consider having the projects.json file live in an upper level config folder instead of buried in the docker structure.

simplify docker deployment process

Work has been done to consolidate the project into three docker containers in the pennai_lite branch:

lab - the main server; contains the a modified version of FGLab server, the webserver and the ai recommender
machine - contains a modified version of FGMachine
dbmongo - database server, structure somewhat defined by FGLab

This seems like a reasonable server configuration for now. Continue to to clean up the deployment process; continue to clean up and simplify Dockerfiles, add error checking to entrypoint.sh/startup.sh where appropriate, clean up dependencies, remove deprecated files, consolidate entrypoint.sh/start.sh files used by lab, clean up 'dockers/base' to only include what really needs to be shared, etc.

create base recommender skeleton

create a base recommender skeleton so that others can write custom recommender algorithms to compare to the baseline algorithm. Transfer the baseline algorithm to an AverageRecommender class.

Refine AI codes and add unit tests

Move AI unit tests to AI scripts folder
Add more unit tests for AI codes, like mocking a database query for testing AI.

Rule-based Recommender

I'm interested in assembling a rule-based recommender for PennAI. There are a number of themes that I think will ultimately be important to a strong functioning recommender.

*Meta-features of the dataset (e.g. sample size, number of features, etc) will be important for making early recommendations on a new dataset. At this point we know nothing else about a dataset other than these metafeatures. We don't know if the dataset signal is clean, noisy, simple complex, univariate, multivariate, etc. Thus early recommendations should weigh most heavily on available meta-feature info on the data.

*The recommender should always start with the (or one of the) simplest, fastest algorithms and parameter settings, essentially assuming that the data pattern may be a simple one. If it turns out not to be simple, then we have some place logical to build from.

*After the first recommendation I think it will be important for the recommender to focus on changes in metric performance as it transitions from one ML to another or one parameter setting to another. The system should both learn from these performance differences, and apply them when deciding what to recommend next, based on observed performance differences it's seen so far in modeling the current dataset of interest.

*I think that one potentially good way to approach this problem is to update a number of evidence categories on a given new dataset for analysis. These categories would be general themes that are understood to be important factors in the success of one ML algorithm or one parameter setting over another. These might be categories like (small or large feature space, noisy or clean problem, simple or complex associations, no missing or missing data, classification or regression, etc). Over the course of making recommendations the algorithm will update evidence regarding where it knows or thinks the dataset lies in each of these categories, and these probabilities will feed into determining the type of machine learner (and parameters) that gets picked next.

*Ultimately there are two kinds of recommendations we want the system to make. A starting point recommendation (what single or set of initial runs do we want to test on this new dataset. And the second is after the first or first few analyses, what is the next best ML or parameters to test? I think these are almost two separate prediction problems that will need to be handled differently by the recommender.

*I think that there should be a mechanism built into the recommender that notices when performance improvements have stagnated despite having made educated recommendations, at which point the recommender switches to a random exploratory recomendation mode, picking an ML or parameter settings that it hasn't tried, and that there may be no, or low evidence to support it's selection. Determining if this random approach is still used will be based on whether any new improvements to performance are observed following this random choice.

*Regarding a rule-based method, i might look into chained rules, where one rule can activate another. I think this approach might be useful in this context.

Weighted Recommender System

This recommender system will compute a score based on the weighted accuracy, run time, and interpretability of each method.

The recommender will identify which method has a score that is significantly better than that of another method. For example, methodA and methodB receive weighted scores that are not significantly different from each other. methodA and methodB are both significantly better than methodC. The ranking for this instance would look like this:

methodA and methodB
methodC

The weights will be parameterized and will add to 1. For example, scores will be updated like so:
new_scores = weightA * accuracy + weightB * runtime + weightC * interpret
and
weightA + weightB + weightC = 1

The accuracy of variable will be the traditional accuracy of each method. The runtime variable is also straightforward and is the time it takes each method. The interpret variable represents the interpretability of each method. For example, there will be an objective ranking of each of the methods by their complexity for the user to understand. There could be a ranking system that tiers methods and gives them a score of 1, 2, or 3. Methods like logistic regression and decision trees are easier to understand and could be assigned weights of 3 whereas methods that are more complex, such as random forest, may be assigned a 1.

As the framework outlines this would all be computed in the update method. A separate method, _assign_score would be used find the value of interpret variable for each run.

This recommender system will identify which scores are significantly better than other scores. This would be calculated in the recommend method which ranks the methods based on scores. Two or more methods with scores that are not significantly different would be ranked equally.

Finally, a grid search will determine the optimal weights (values of weightA, weightB, and weightC) for the recommender.

Remove Q library from lab.js (with tests)

modularize

re-structure code as module so it can be distributed as a package.

organize the file structure by concern, come up with a naming convention

Organize files by docker container/purpose, perhaps:

lab - the main server, contains FGLab server, the webserver, the ai recommender, redis/basic user management
machine - contains a modified version of FGMachine
dbmongo - contains dbmongo
shared? - resources that would be shared between multiple containers/servers?
dockers - docker deployment

Also come up with a cool naming convention/better names for the docker containers/servers then "lab", "machine", "dbmongo"

This should make it easier for the different parts of the project (ai recommender, supported ml algorithms, webserver) to be developed independently, simplify the docker deployment process and make it easier to change/support different platforms (docker, aws, azure, ???) in the future if necessary.

References #28

Remove redis in lab and get socket functionality working again

add sklearn-benchmark results to DB

Some json result files produced after running a ml experiment are not able to be loaded

When machine is pushing files to lab after running, one/some of the .json files that are pushed are not able to be parsed correctly. Can reproduce by running int tests (DecisionTreeClassifier on the Adult dataset).

In machine.js, see:

if (path.match(/\.json$/)) {
    // Process JSON files
    filesP.push(fs.readFile(path, "utf-8").then(sendJSONResults));

Example log output:

0|machine  | pushing /appsrc/machine/learn/DecisionTreeClassifier/tmp/5b5b763b53674b00329201cb/roc_curve.json
0|machine  | pushing /appsrc/machine/learn/DecisionTreeClassifier/tmp/5b5b763b53674b00329201cb/value.json
0|machine  | pushing /appsrc/machine/learn/DecisionTreeClassifier/tmp/5b5b763b53674b00329201cb/prediction_values.json
0|machine  | You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection:
0|machine  | StatusCodeError: 500 - "Error: MongoError: Modifiers operate on fields but we found a Array instead. For example: {$mod: {<field>: ...}} not {$set: [ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, ...
...
...
...
...
0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0
0|machine  | , 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1 ]}"
0|machine  |     at new StatusCodeError (/appsrc/machine/node_modules/request-promise/lib/errors.js:32:15)
0|machine  |     at Request.RP$callback [as _callback] (/appsrc/machine/node_modules/request-promise/lib/rp.js:77:29)
0|machine  |     at Request.self.callback (/appsrc/machine/node_modules/request/request.js:185:22)
0|machine  |     at emitTwo (events.js:106:13)
0|machine  |     at Request.emit (events.js:191:7)
0|machine  |     at Request.<anonymous> (/appsrc/machine/node_modules/request/request.js:1157:10)
0|machine  |     at emitOne (events.js:96:13)
0|machine  |     at Request.emit (events.js:188:7)
0|machine  |     at Gunzip.<anonymous> (/appsrc/machine/node_modules/request/request.js:1079:12)
0|machine  |     at Gunzip.g (events.js:292:16)
0|machine  |     at emitNone (events.js:91:20)
0|machine  |     at Gunzip.emit (events.js:185:7)
0|machine  |     at endReadableNT (_stream_readable.js:974:12)
0|machine  |     at _combinedTickCallback (internal/process/next_tick.js:80:11)
0|machine  |     at process._tickDomainCallback (internal/process/next_tick.js:128:9)

Note that StatusCodeError: 500 - "Error: MongoError: Modifiers operate on... is the error message returned from lab.

model-by-method recommender

basic idea:

each ML method's results are subsetted into training sets that include that method's parameters as features, as well as metafeatures of the dataset.
an ML model is fit to each subset of data using self.metric as the target (endpoint).

when making a recommendation, do the following:

generate predictions for each method-specific model over the range of available parameter combinations.
rank algorithm-parameter combinations by the scores of these predictions and return recommendations in order.

Develop+validate a DNN-based recommender system for PennAI

The basic idea is using a Deep Neural Network to predict the best ML method+parameter setting based on PMLB.
Eventually, I was thinking of using Wide&Deep neural network for this, but previous attempts will make use of other types of deep neural nets.

Add examples about adding or editing ML algorithms

don't recommend learners that have been run

filter the recommendations so that results already in the database are not re-run.

put sklearn benchmarks data in database

add sklearn-benchmark5-data.tsv.gz to the database in json format.

Add ai as a service in Lab

Add ai as a service in Lab for starting automatically in lab container.

Datasets not initally available when starting from scratch

When starting pennai completely from scratch, the only datasets initially available via the web interface are
(Adults, Breast Cancer, Gametes, Hypothyroid, Mushrooms, Readmissions); the initial datasets defined in pennai\lab\examples\Users\Users.json. It seems that after some time if the containers are restarted all of the datasets are available.

The cause may be that the process that does the initial loading of the datasets takes a while to run, and does so silently.

update docker scripts to not clone from github or use static copies of the source

Part of the docker setup process for all three containers clones pennai:master into /opt/. The docker build process seems to rely on /opt/ (for example the lab Dockerfile runs 'RUN npm install...' from /opt/). The running containers seem to optionally rely on the contents of /opt/ or the local version of pennai as exposed via a docker shared drive.

This seems confusing and error prone, especially if there is a mismatch between local files and github master. Determine if there was a reason for cloning the project (perhaps as part of AWS deployment?). If it makes sense, update the deployment scripts to copy from the local to /opt/ instead of cloning from github. Attempt to copy only what is necessary for the given container (see #34)

Move git clone from the base docker image to lab, machine, dbmongo
Remove git clone from lab
Remove git clone from machine
Remove git clone from dbmongo

Random recommender?

Should we implement a recommender that recommends random algo/param settings based on the ones it's been updated with? I sense that this type of recommender could make a good baseline aside from the simple AverageRecommender.

Use static identifiers for machine algorithms

Currently ML algorithm ("projects" by FG nomenclature) ids are being generated randomly whenever a machine container starts up, and the database is not successfully updated with the new ids if it is not empty. This can cause the state of machine to be out of sync with the database, and if there are multiple machines they will have different ids for the same algorithm.

One symptom of this is the inability to run experiments from the UI.
Use static identifiers for machine algorithms, either through a config file or some naming convention.

Set up base recommender experiment

The base recommender experiment should simulate the PennAI system starting with a base set of knowledge, recommending ML algos + params, and evaluating how long it takes for the recommender to recommend a ML algo + params that are within some parameterized delta of the best possible score.

Try to save fitted modules via pickle or hacking sklearn object.
Make decisions tree graphs for top important features.