epistasislab / aliro Goto Github PK
View Code? Open in Web Editor NEWAliro: AI-Driven Data Science
Home Page: https://epistasislab.github.io/Aliro
License: GNU General Public License v3.0
Aliro: AI-Driven Data Science
Home Page: https://epistasislab.github.io/Aliro
License: GNU General Public License v3.0
Refactor how datasets are initialized, stored, accessed, and added.
Additional requirements to make the single dataset registration api call not rely on the dataset being in the app filesystem
ai/metalearning/get_metafeatures.py
to use raw data instead of (or as well as) a file pathlab/metafeature.js
to use raw data instead of a pathapi/v1/datasets
does not need to specify a filepathrebase pennai_lite to be the new master
clean up other branches
Replace the node scripts in /awsm/*.js which are responsible for rebuilding and starting docker containers with a docker compose file. Should simplify the install process and resource management.
Docker compose files seem to be supported by AWS and Azure as a means of configuring deployment.
recommender methods should be able to query the database directly for info (e.g. whether an ML+P combo has been run on a dataset) rather than accessing static files. @djfunksalot and I will work on this
@djfunksalot Please add detailed documentation about installing PennAI in local machine.
write unit tests for recommender, including mock database for randomrecommender.
node_modules is too large, ~1Gb. We may need some clean-ups
It is not safe. And this issue is related to #35.
Duplicate files are uploaded when an ml experiment completes. Sample of a completed experiment object:
{ _id: '5b5b61cc5fd793003124a761',
_options:
{ criterion: 'gini',
max_depth: 1,
min_samples_split: 2,
min_samples_leaf: 1 },
_dataset_id: '5b5b61c65fd793003124a700',
_project_id: '5b5b618b596f2e0a82562522',
_machine_id: '5b5b61c05fd793003124a67c',
username: 'testuser',
files:
[ { _id: '5b5b61c65fd793003124a711',
filename: 'adult.csv',
mimetype: 'text/csv',
timestamp: 1532715463639 },
{ _id: '5b5b61d45fd793003124a762',
filename: 'model_5b5b61cc5fd793003124a761.pkl',
mimetype: 'application/octet-stream',
timestamp: 1532715476354 },
{ _id: '5b5b61d45fd793003124a763',
filename: 'model_5b5b61cc5fd793003124a761.pkl',
mimetype: 'application/octet-stream',
timestamp: 1532715476365 },
{ _id: '5b5b61d65fd793003124a766',
filename: 'scripts_5b5b61cc5fd793003124a761.py',
mimetype: 'application/octet-stream',
timestamp: 1532715478162 },
{ _id: '5b5b61d65fd793003124a767',
filename: 'scripts_5b5b61cc5fd793003124a761.py',
mimetype: 'application/octet-stream',
timestamp: 1532715478163 },
{ _id: '5b5b61d65fd793003124a76a',
filename: 'imp_score5b5b61cc5fd793003124a761.png',
mimetype: 'image/png',
timestamp: 1532715478612 },
{ _id: '5b5b61d65fd793003124a76b',
filename: 'imp_score5b5b61cc5fd793003124a761.png',
mimetype: 'image/png',
timestamp: 1532715478612 },
{ _id: '5b5b61da5fd793003124a76f',
filename: 'confusion_matrix_5b5b61cc5fd793003124a761.png',
mimetype: 'image/png',
timestamp: 1532715482656 },
{ _id: '5b5b61da5fd793003124a76e',
filename: 'confusion_matrix_5b5b61cc5fd793003124a761.png',
mimetype: 'image/png',
timestamp: 1532715482657 },
{ _id: '5b5b61da5fd793003124a772',
filename: 'roc_curve5b5b61cc5fd793003124a761.png',
mimetype: 'image/png',
timestamp: 1532715482986 },
{ _id: '5b5b61da5fd793003124a773',
filename: 'roc_curve5b5b61cc5fd793003124a761.png',
mimetype: 'image/png',
timestamp: 1532715482988 } ],
Currently there are some python unit tests written for the ai recommender and machine.
Create some basic integration tests and some sort of test runner that can run the integration tests and existing unit tests from the command line.
This behavior does not always occur or unsure how to reliably reproduce.
Some example lab log output for a failure that occurred after automatically starting the ai using common.env and then toggling the ai button gives the following in lab logs:
lab_1 | 2018 01:10:33 AM UTC : checking requests...
lab_1 | 2018 01:10:35 AM UTC : checking results...
lab_1 | requesting from : %s http://lab:5080/api/experiments
lab_1 | 2018 01:10:35 AM UTC : checking requests...
lab_1 | 2018 01:10:35 AM UTC : new ai request for: ['adult']
lab_1 | foo
lab_1 | Traceback (most recent call last):
lab_1 | File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
lab_1 | "__main__", mod_spec)
lab_1 | File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
lab_1 | exec(code, run_globals)
lab_1 | File "/appsrc/ai/ai.py", line 460, in <module>
lab_1 | main()
lab_1 | File "/appsrc/ai/ai.py", line 446, in main
lab_1 | if pennai.check_requests():
lab_1 | File "/appsrc/ai/ai.py", line 193, in check_requests
lab_1 | q_utils.startQ(self,r['_id'])
lab_1 | File "/appsrc/ai/q_utils.py", line 33, in startQ
lab_1 | thread = datasetThread(d,self)
lab_1 | File "/appsrc/ai/q_utils.py", line 16, in __init__
lab_1 | self.name = p.user_datasets[threadID]
lab_1 | KeyError: '5b6259eb509e960031bf9fae'
lab_1 | [TAILING] Tailing last 15 lines for [all] processes (change the value with --lines option)
lab_1 | /root/.pm2/pm2.log last 15 lines:
lab_1 | PM2 | [2018-08-02T01:09:24.112Z] PM2 log: PM2 version : 3.0.3
lab_1 | PM2 | [2018-08-02T01:09:24.112Z] PM2 log: Node.js version : 6.12.3
I followed the instructions to build the project via docker-compose. I ran into three issues:
docker-compose build
, it complains that the version in the yaml file (3) is too high. i had to switch it to 2 to build.ERROR: compose.cli.errors.log_timeout_error: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
In each experiment, machine\learn\io_utils.py will request dataset file via API and then save a copy into temporary folder then create a symbolic link pointing to the copy. I think it waste storage in tmp space when running a lot of experiment. So I will remove the symbolic links and copy of dataset in temporary folder and make io_utils.py return pandas. dataframe directly in each experiment.
for each dataset experiment, export model to mongo so that it can be returned as python code.
The size of metafeatures need to be the same as results_data
More unit tests in Machine and move test codes to machine folder
I suggest MIT but I'm open to other open source licenses. We should do this ASAP.
We need detailed documentation about how to install PennAI into AWS.
Related to #25
Need set class_weight = 'balanced' for those ML methods in machine.
In machine/learn/io_utils.py, data about the available algorithms(projects) is being pulled from /lab/examples/Algorithms/* (look for the phraseself.schema = basedir + '/lab/examples/Algorithms/'
...).
Machine should not be referencing data files in /lab, and this information is redundant to what is loaded in the projects folder of the database (initially loaded from /dockers/dbmongo/files/projects.json) which is being used in other parts of the application (for example to see which algorithms are valid when a machine is registering it's algorithms with the lab api). Machine should be getting this data from the lab api.
Might also consider having the projects.json file live in an upper level config folder instead of buried in the docker structure.
Work has been done to consolidate the project into three docker containers in the pennai_lite branch:
This seems like a reasonable server configuration for now. Continue to to clean up the deployment process; continue to clean up and simplify Dockerfiles, add error checking to entrypoint.sh/startup.sh where appropriate, clean up dependencies, remove deprecated files, consolidate entrypoint.sh/start.sh files used by lab, clean up 'dockers/base' to only include what really needs to be shared, etc.
create a base recommender skeleton so that others can write custom recommender algorithms to compare to the baseline algorithm. Transfer the baseline algorithm to an AverageRecommender class.
I'm interested in assembling a rule-based recommender for PennAI. There are a number of themes that I think will ultimately be important to a strong functioning recommender.
*Meta-features of the dataset (e.g. sample size, number of features, etc) will be important for making early recommendations on a new dataset. At this point we know nothing else about a dataset other than these metafeatures. We don't know if the dataset signal is clean, noisy, simple complex, univariate, multivariate, etc. Thus early recommendations should weigh most heavily on available meta-feature info on the data.
*The recommender should always start with the (or one of the) simplest, fastest algorithms and parameter settings, essentially assuming that the data pattern may be a simple one. If it turns out not to be simple, then we have some place logical to build from.
*After the first recommendation I think it will be important for the recommender to focus on changes in metric performance as it transitions from one ML to another or one parameter setting to another. The system should both learn from these performance differences, and apply them when deciding what to recommend next, based on observed performance differences it's seen so far in modeling the current dataset of interest.
*I think that one potentially good way to approach this problem is to update a number of evidence categories on a given new dataset for analysis. These categories would be general themes that are understood to be important factors in the success of one ML algorithm or one parameter setting over another. These might be categories like (small or large feature space, noisy or clean problem, simple or complex associations, no missing or missing data, classification or regression, etc). Over the course of making recommendations the algorithm will update evidence regarding where it knows or thinks the dataset lies in each of these categories, and these probabilities will feed into determining the type of machine learner (and parameters) that gets picked next.
*Ultimately there are two kinds of recommendations we want the system to make. A starting point recommendation (what single or set of initial runs do we want to test on this new dataset. And the second is after the first or first few analyses, what is the next best ML or parameters to test? I think these are almost two separate prediction problems that will need to be handled differently by the recommender.
*I think that there should be a mechanism built into the recommender that notices when performance improvements have stagnated despite having made educated recommendations, at which point the recommender switches to a random exploratory recomendation mode, picking an ML or parameter settings that it hasn't tried, and that there may be no, or low evidence to support it's selection. Determining if this random approach is still used will be based on whether any new improvements to performance are observed following this random choice.
*Regarding a rule-based method, i might look into chained rules, where one rule can activate another. I think this approach might be useful in this context.
This recommender system will compute a score based on the weighted accuracy, run time, and interpretability of each method.
The recommender will identify which method has a score that is significantly better than that of another method. For example, methodA and methodB receive weighted scores that are not significantly different from each other. methodA and methodB are both significantly better than methodC. The ranking for this instance would look like this:
The weights will be parameterized and will add to 1. For example, scores will be updated like so:
new_scores = weightA * accuracy + weightB * runtime + weightC * interpret
and
weightA + weightB + weightC = 1
The accuracy of variable will be the traditional accuracy of each method. The runtime variable is also straightforward and is the time it takes each method. The interpret variable represents the interpretability of each method. For example, there will be an objective ranking of each of the methods by their complexity for the user to understand. There could be a ranking system that tiers methods and gives them a score of 1, 2, or 3. Methods like logistic regression and decision trees are easier to understand and could be assigned weights of 3 whereas methods that are more complex, such as random forest, may be assigned a 1.
As the framework outlines this would all be computed in the update method. A separate method, _assign_score would be used find the value of interpret variable for each run.
This recommender system will identify which scores are significantly better than other scores. This would be calculated in the recommend method which ranks the methods based on scores. Two or more methods with scores that are not significantly different would be ranked equally.
Finally, a grid search will determine the optimal weights (values of weightA, weightB, and weightC) for the recommender.
re-structure code as module so it can be distributed as a package.
Organize files by docker container/purpose, perhaps:
Also come up with a cool naming convention/better names for the docker containers/servers then "lab", "machine", "dbmongo"
This should make it easier for the different parts of the project (ai recommender, supported ml algorithms, webserver) to be developed independently, simplify the docker deployment process and make it easier to change/support different platforms (docker, aws, azure, ???) in the future if necessary.
References #28
When machine is pushing files to lab after running, one/some of the .json files that are pushed are not able to be parsed correctly. Can reproduce by running int tests (DecisionTreeClassifier on the Adult dataset).
In machine.js, see:
if (path.match(/\.json$/)) {
// Process JSON files
filesP.push(fs.readFile(path, "utf-8").then(sendJSONResults));
Example log output:
0|machine | pushing /appsrc/machine/learn/DecisionTreeClassifier/tmp/5b5b763b53674b00329201cb/roc_curve.json
0|machine | pushing /appsrc/machine/learn/DecisionTreeClassifier/tmp/5b5b763b53674b00329201cb/value.json
0|machine | pushing /appsrc/machine/learn/DecisionTreeClassifier/tmp/5b5b763b53674b00329201cb/prediction_values.json
0|machine | You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection:
0|machine | StatusCodeError: 500 - "Error: MongoError: Modifiers operate on fields but we found a Array instead. For example: {$mod: {<field>: ...}} not {$set: [ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, ...
...
...
...
...
0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0
0|machine | , 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1 ]}"
0|machine | at new StatusCodeError (/appsrc/machine/node_modules/request-promise/lib/errors.js:32:15)
0|machine | at Request.RP$callback [as _callback] (/appsrc/machine/node_modules/request-promise/lib/rp.js:77:29)
0|machine | at Request.self.callback (/appsrc/machine/node_modules/request/request.js:185:22)
0|machine | at emitTwo (events.js:106:13)
0|machine | at Request.emit (events.js:191:7)
0|machine | at Request.<anonymous> (/appsrc/machine/node_modules/request/request.js:1157:10)
0|machine | at emitOne (events.js:96:13)
0|machine | at Request.emit (events.js:188:7)
0|machine | at Gunzip.<anonymous> (/appsrc/machine/node_modules/request/request.js:1079:12)
0|machine | at Gunzip.g (events.js:292:16)
0|machine | at emitNone (events.js:91:20)
0|machine | at Gunzip.emit (events.js:185:7)
0|machine | at endReadableNT (_stream_readable.js:974:12)
0|machine | at _combinedTickCallback (internal/process/next_tick.js:80:11)
0|machine | at process._tickDomainCallback (internal/process/next_tick.js:128:9)
Note that StatusCodeError: 500 - "Error: MongoError: Modifiers operate on...
is the error message returned from lab.
basic idea:
when making a recommendation, do the following:
The basic idea is using a Deep Neural Network to predict the best ML method+parameter setting based on PMLB.
Eventually, I was thinking of using Wide&Deep neural network for this, but previous attempts will make use of other types of deep neural nets.
filter the recommendations so that results already in the database are not re-run.
add sklearn-benchmark5-data.tsv.gz to the database in json format.
Add ai as a service in Lab for starting automatically in lab container.
When starting pennai completely from scratch, the only datasets initially available via the web interface are
(Adults, Breast Cancer, Gametes, Hypothyroid, Mushrooms, Readmissions); the initial datasets defined in pennai\lab\examples\Users\Users.json. It seems that after some time if the containers are restarted all of the datasets are available.
The cause may be that the process that does the initial loading of the datasets takes a while to run, and does so silently.
Part of the docker setup process for all three containers clones pennai:master into /opt/. The docker build process seems to rely on /opt/ (for example the lab Dockerfile runs 'RUN npm install...' from /opt/). The running containers seem to optionally rely on the contents of /opt/ or the local version of pennai as exposed via a docker shared drive.
This seems confusing and error prone, especially if there is a mismatch between local files and github master. Determine if there was a reason for cloning the project (perhaps as part of AWS deployment?). If it makes sense, update the deployment scripts to copy from the local to /opt/ instead of cloning from github. Attempt to copy only what is necessary for the given container (see #34)
Should we implement a recommender that recommends random algo/param settings based on the ones it's been updated with? I sense that this type of recommender could make a good baseline aside from the simple AverageRecommender
.
Currently ML algorithm ("projects" by FG nomenclature) ids are being generated randomly whenever a machine container starts up, and the database is not successfully updated with the new ids if it is not empty. This can cause the state of machine to be out of sync with the database, and if there are multiple machines they will have different ids for the same algorithm.
One symptom of this is the inability to run experiments from the UI.
Use static identifiers for machine algorithms, either through a config file or some naming convention.
The base recommender experiment should simulate the PennAI system starting with a base set of knowledge, recommending ML algos + params, and evaluating how long it takes for the recommender to recommend a ML algo + params that are within some parameterized delta
of the best possible score.
@djfunksalot Please make a video demo to install pennai locally and on AWS with a vanilla environment.
recommender should have method to store scores in the database for persistence.
probably makes sense to use pyMongo for this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.