sahahn / bpt_app Goto Github PK

The Brain Predictability toolbox app (BPt_app) is designed to offer a GUI experience to the base BPt python library.

Python 23.44% Hack 1.94% JavaScript 66.31% PHP 3.11% CSS 4.88% Dockerfile 0.27% Shell 0.05%

bpt_app's Introduction

Welcome to the Brain Predictability toolbox (BPt) Web Interface.

Intro

This project is designed to be an easy to use user interface for performing neuroimaging based machine learning (ML) experiments.

This is an early beta release version, so please be mindful that their will likely be some rough edges. Please open an issue with any errors that comes up!

The main python library (that serves as a backend for this application) can be found at: https://github.com/sahahn/BPt.

Installation

As it currently stands, BPt_app is designed to be created and run in a docker container. Please follow the below instructions:

Make sure you have docker installed on your device. See: https://docs.docker.com/get-docker/
Secondly, we make use of docker-compose to make the installation process overall more painless. On some systems with will already be installed with docker, but on others you may need to perform additional steps to download it, see: https://docs.docker.com/compose/install/
Next, you will clone this repository to your local device. On unix based systems, the command is as follows:
```
git clone https://github.com/sahahn/BPt_app.git
```
An essential step to using to using the application is the ability to have the application access your datasets of interest. Importantly, adding datasets can be done either before installation or after.
1. Datasets are saved within BPt_app in the folder 'BPt_app/data/sources'
2. Datasets must be compatible with BPt, this requires the user to format the dataset accordingly, before adding it to the sources directory. Specifically, datasets are comprised of a folder (where the name of the folder is name of the dataset), and within that folder 1 or more csv files with the datasets data. For example:
```
 BPt_app/data/sources/my_dataset/
 BPt_app/data/sources/my_dataset/data1.csv
 BPt_app/data/sources/my_dataset/data2.csv
 BPt_app/data/sources/my_dataset/data3.csv
 
```
1. Each file with data (data1.csv, data2.csv data3.csv above) must also be formatted in a specific way. Specifically- all data files must be comma seperated and contain only one header row with the name of each feature (or an index name / eventname - described in the next steps). For example (where note the \n character is ussually hidden in most text editors):
```
 subject_id,feat1,feat2,feat3\n
 a,1.4,9,1.22\n
 b,1.3,9,0.8\n
 c,2,10,1.9\n
 
```
1. Each file must have a column with a stored subject id. Valid names for this subject id column are currently: ['subject_id', 'participant_id', 'src_subject_id', 'subject', 'id', 'sbj', 'sbj_id', 'subjectkey', 'ID', 'UID', 'GUID'] As long as a column is included and saved under one of those names, then that column will be used iternally as the subject id. In the example above, 'subject_id' is used as the subject id column.
2. Next, each data file can optionally be stored with a valid 'event name' column. This column should be stored in the same way as the subject id column, and is used in cases where the underlying dataset is for example longitudinal or any case where a feature contains multiple values for the same subject. Valid column names for this are currently: ['eventname', 'event', 'events', 'session_id', 'session', 'time_point', 'event_name', 'event name'] Within BPt_app, this column lets you filter data by a specific eventname value. Note eventnames cannot contain the reserved string ' - '.
3. A few general notes about adding data to BPt:
  - You may add multiple datasets, just with different folder names
  - Data will be processed by BPt upon launch of the web application, this means that if you add a new dataset once the application has already been launched initially, that dataset will be processed upon the next launch of the application. Re-loading the web page can trigger the app to look for changes to the backend data.
  - If a feature / column overlaps across different data sources, e.g., data1.csv, data2.csv, then that feature will be merged across all data files, and saved in a new file. Merge behavior is if new values are found (as indexed by subject id and eventname overlap) they are simply added. If overlapped values are found, the newer value for that subject_id / eventname pair will be used.
  - You can change or delete data files or datasets at will, this will just prompt BPt to re-index that dataset and changes will be made accordingly.
Now, to install the application, navigate within the main BPt_app folder/repository and run the docker compose command:
```
docker-compose up
```
This will take care of building the docker image and application. There are a number of different tweaks here that you can make as desired, some of these are listed below:
- You may pass the flag "-d", so "docker-compose up -d", which will run the docker container in the background, otherwise the docker instance will be tied to your current terminal (and therefore shutdown if you close that terminal). See https://docs.docker.com/compose/reference/up/ for other simmilar options.
- Before running docker-compose up, you can optionally modify the docker-compose.yml file. One perhaps useful modification is to change the value of restart: no, to restart: always what this will do is restart BPt_app whenever it goes down, e.g., when you restart your computer. Otherwise, you must start the container manually everytime you wish to use BPt_app after a restart.
- You can use the command 'docker-compose start' from the BPt_app directory to restart the container
- Likewise, you can use the command 'docker-compose stop' to stop the web app
After the container is running, navigate to http://localhost:8008/BPt_app/WebApp/index.php this is the web address of the app, and should bring you to the home page!

Once up and running

The most useful commands to know once up and running are those used to start and stop the container (as mentioned above) with docker-compose, and also updating. There are two main ways to update. A faster temporary update (where the update will persist across stopping and starting the docker container, e.g., docker-compose start and stop, but will be deleted if docker-compose down is ever called). To call this faster temporary update, naviagte to the BPt_app folder and run the command (Note: the container must be running, e.g., docker-compose start called):

bash update.sh

If instead you would like to do a full and lasting update, this involves re-building the whole container. It will also call git pull on your main directory, looking for changes in the docker files. To run this full update, run within BPt_app:

bash full_update.sh

bpt_app's People

Contributors

Watchers

Forkers

harel-coffee

bpt_app's Issues

Add control for merge behavior

Current merge behavior is fixed as computing the inner overlap of subjects across different loading. Alternatively, could give the user control to set any non-overlapping subjects data to NaN and still keep those subjects, i.e., outer merge.

Changing from single select to multiple select

Every by-val, from train only subjects, to inclusions / exclusions should support multiple values (instead of just a single value as currently implemented)

Weird behavior when multiple set tables are loaded

Seems to also effect other loaded tables

Make Model + Parameter Search non-draggable

Very low priority bug

Develop more in-depth help pages

An interesting idea would be to add in a separate set of pages which could be filled in with more advanced descriptions of certain things. E.g., descriptions on considerations for cross-validation w/ pictures or whatever.

Better ensemble support

Better integration / support / controls + options for integrating Ensembles. E.g., DES split, stacking regressor, etc...

Should have automatic detection of base model type, and show relevant options. Should be able to specify the model responsible for stacking for example.

Quick copy pipeline

It would be nice to have a feature where you could select to make a copy of an existing pipeline, e.g., similar to how param dists are setup, could be helpful when only trying to change a few small things.

Automatically detect if parameter search is needed

Change behavior s.t., parameter search starts as None, but when any set of params requiring a search or Select is specified, automatically change the search to RandomSearch.

Could alternatively change it to be a warning, i.e., cause a visual change, and have that pipeline not appear as valid.

Support for submitting jobs to remote clusters

Add in support for integrating remote clusters, i.e., ideally, you could set it up to submit jobs to a cluster. Implementation would likely be similar to VACC_EXT setup, but could involve something different (e.g., maybe running the full docker setup + app on an interactive slurm job??? Can you submit jobs from within interactive slurm jobs?)

Only show imputer if any NaN

Along the lines of automating things that people shouldn't need to think about, add in the automatic hiding of imputers if no NaN data is loaded.

A check should be built into Evaluate/Test.py where if the job already exists, an error should be thrown

This is a rather rare bug, but it seems important to fix it

Logs on start screen

Add an explicit log for data loading / proc changes that can be viewed instead of the waiting screen. Also, change the start screen to look a bit better, especially since it will pop up first every time. So maybe have it start with something like: "Checking for changes to the underlying dataset!"

Transition Choice should support an order

As transition choice implies an ordering, this should really be an option

Caching settings

Add the different caching options to the settings page. Make more transparent plus add optional storage limits.

GUI upload datasets

Create a nice GUI interface for uploading custom datasets - i.e., with more controls and flexibility for comma-separated vs tab-separated, options for which column is the subject id, and which is event name.

Related, would be a GUI screen to see what datasets are available and maybe delete some?

Better logs for data loading

Have in the logs it tell you where the latest ML object from a save is stored. Or rather, should have an option somewhere to download the pickled ML object, so one could for example just use the GUI for data loading.

Multiple logins

Investigate how to handle multiple tabs/windows open from the same user? Right now, this behavior will likely just break things in unexpected ways. Not sure the best way to fix it, seems like it would require more frequent communication with the server / re-writing a lot of how things are currently stored.

Issues with display of distribution name

Sometimes upon init/refresh, the distribution name does not properly appear above model / object pieces. Having difficulty re-creating this bug though.

Jobs that result in an error are not handled properly.

Jobs that result in an error are not handled properly. Relevant error text should be displayed when clicking show, their might be other errors too

Sorting results by progress not working

Sorting results by progress not working, should sort by percent done as percent, and finished as 1

Saving / importing / sharing pipelines

Add in support for saving pipelines both in version across multiple users, and just between projects. e.g., user should be able to "import" a pipeline from a different project into their current one.

Force casting index to string.

Not 100% sure this is actually a problem, but it is worth looking into more

Build in explicitly a single / multiuser mode

A multi-user mode will eventually be put on the DEAP. This means that a number of settings need to be quickly enabled/disabled based on which mode is being run.

This includes:

Linking to a different Sets page
Removing the choice of dataset
Changing how variables are selected
Changing the helper text in some cases
Changing how data is loaded
Removing the load database check + associated pieces
Likely more...

Loading set variable display issues

Figure out a fix for what to display when loading a single set variable, when the whole set is loaded, e.g., if filter set on the whole set, then the log will be flooded with before + after values for every variable in the set.

Comparing two (or more) completed jobs

Interface/option to compare the results from two different runs. Maybe some requirement like they both need to be on the same target, and either both Evaluate or Test?

Would involve mostly generating tables? Or some plotting? Not sure.

Long variable names in plots and tables

Added some functionality already to truncate when too long, but still breaks in a lot of cases and doesn't look great

Info button needed on Sets page

Add help info buttons on the settings page explaining what the Short Event Names are.

Re-visit param caching

In some cases, the current implementation might not be working 100% correctly. Also should change it so param caching is dataset-specific rather than global~

Deleting temp jobs

Temp jobs should be somewhat regularly deleted, to avoid building up space, and also in cases when silent errors occur. One way of doing this is how validation jobs work, where if the output already exists at the start of the job, it is deleted. This won't work for jobs with names though

Caching w/ data loading

Changing the event shortname still seems to break data loading caching. I.e., will still load an incorrectly cached previous copy.

HTML entrypoints

Right now, most of the logic is handled in javascript. Could be helpful in the future to add meaningful entry points, e.g. /project_name/page

First time loading bug

On first time loading, will try and get the ML_options before setup_info.py has been run. Need to change the order of operations so that ML_options are not loaded until after setup has been called.

Add filter float / cont in cont. to binary load options

For cont. to binary option, should be able to perform float / cont. type outlier dropping

The new event appended name breaks the validation table

The Show info validation table bases new columns off of the presence of spaces. Change this

Investigate df feature selector warning + fix the selector feature selector

This functionality needs to be investigated and fixed

Add support for Feature Importance's

This constitutes a fairly large effort and might involve to some extent changes on the BPt side of things. Thinking now that the place to specify what feature importances to calculate should be on the Evaluate tab. In whatever ways possible, the options should be "smartly" generated, i.e., so not displaying irrelevant params. Then also involved is added a section on the results for each job to view the feature importances in different ways.

Initializing the data may not work as intended

For some reason, it seems that one of the merge settings isn't working quite as intended. E.g., ended up with overlapped subjects at the same timepoint in OVERLAP0.csv

Remove sample-wise scorers from BPt

Sample-wise scorers are just for multi-label problems, we have no plans to support those for now, so should remove them as options.

More info on jobs in results

There should potentially be more entries in the main table, maybe Elapsed? But also, when opening a job, it should allow the user to see more detailed information on how that job was run, e.g., what pipeline was used, etc...

Set page minor looks changes

Change it so adding a new set is always at the top instead of the bottom? Maybe not.

Show 'X' not preserved on new set search

When searching for a new set, if the user had previously changed the DataTable to display more than 5 entries, this choice will be refreshed back to 5 upon a new search. Should look into propagating the user choice to a new search.

Offer by class metrics?

Pretty low priority would also involve re-adding support to base BPt

Filter by percent and outlier

Right now the UI makes it seems like you can filter by both outlier + std, make sure that the UI reflects the actual behavior / decide on what might be the best behavior. Also make sure it is well documented in the help string

None as a valid Select options

On BPt + BPt_app side, should be able to specify a Select choice between an object or objects and None.

Visual feedback on pressing save projects

When clicking save projects, it would be nice to have the button change for a second, just to provide some visual feedback that clicking the button actually did something

Add in a check to not display a plot with too many categorical categories (i.e., right now just explodes if accidentally passed a variable with say all unique values)

Not the biggest bug, but should be fixed