uncharted-distil / distil Goto Github PK

An analytic workbench for user-guided development of model pipelines

License: Apache License 2.0

Shell 0.19% Go 38.29% Makefile 0.03% JavaScript 0.95% HTML 8.16% Vue 20.77% CSS 10.32% TypeScript 18.13% Dockerfile 0.04% Python 0.91% Less 2.19% EJS 0.02%

distil's Introduction

distil

Related Projects

AutoML Server automated machine learning server component that implements the D3M API.
Primitives set of primitives created for use by Distil as steps in a D3M pipeline and included in the base D3M image.
Primitives Addendum set of primitives created for use by Distil as steps in a D3M pipeline and not included in the base D3M image.

Dependencies

Git and Git LFS Versioning softwares.
Go programming language binaries with the GOPATH environment variable specified and $GOPATH/bin in your PATH.
NodeJS JavaScript runtime.
Docker platform.
Docker Compose (optional) for managing multi-container dev environments.
GDAL v2.4.2 or better for geospatial data access. Available as a package for most Linux distributions, and OSX through Homebrew.

Development

Clone the repository:

mkdir -p $GOPATH/src/github.com/uncharted-distil
cd $GOPATH/src/github.com/uncharted-distil
git clone [email protected]:unchartedsoftware/distil.git
cd distil

Install dependencies:

make install

Install datasets:

Datasets are stored using git LFS and can be pulled using the datasets.sh script.

./datasets.sh

To add / remove a dataset modify the $datasets variable:

declare -a datasets=("185_baseball" "LL0_acled" "22_handgeometry")

Generate code (optional):

To regenerate the PANDAS dataframe parser if the api/compute/result/complex_field.peg file is changed, run:

make peg

Docker images:

The application requires:

ElasticSearch
PostgreSQL
TA2 Pipeline Server Stub

Docker images for each are available at the following registry:

docker.uncharted.software

Login to Docker Registry:

sudo docker login docker.uncharted.software

Update `docker-compose.yml`

---
distil-auto-ml:
  image: docker.uncharted.software/distil-auto-ml

Pull Images:

Pull docker images via Docker Compose:

./update_services.sh

Running the app:

Using three separate terminals:

Terminal 1 - Launch docker containers via Docker Compose:

./run_services.sh

Terminal 2 - Build and watch webapp:

yarn watch

The app will be accessible at localhost:8080.

Terminal 3 - Build, watch, and run server:

make watch

Advanced Configuration

The location of the dataset directory can be changed by setting the D3MINPUTDIR environment variable, and the location of the temporary data written out during model building can be set using the D3MOUTPUTDIR environment variable. The host IP address of the docker containers if not localhost can be set with DOCKER_HOST. (i.e.export DOCKER_HOST=192.168.0.10 && make watch.) These are used by the other Distil services that are launched via the run_services.sh script, and are typically set as global environment variables in .bashrc or similar.

Linter Setup

VSCODE

For the VsCode editor download and install the eslint extension. Once installed go to the editor settings (hot key ⌘⇧p -- type settings) Add the following to your settings file:

  "eslint.lintTask.enable": true, // enable eslint to run
  "eslint.validate": [
    "vue", // tell eslint to read vue files
    "html", // tell eslint to read html files
    "javascript", // tell eslint to read javascript files
    "typescript" // tell eslint to read typescript files
  ],
  "eslint.workingDirectories": [{ "mode": "auto" }], // eslint will try its best to figure out the working directory of the project

At this point save your settings file and restart VsCode. If upon restarting and the linter is not working check the output (^⇧` -- OUTPUT tab -- dropdown -- ESlint)

Common Issues:

"../repo/subpackage/file.go:10:2: cannot find package "github.com/company/package/subpackage" in any of":

Cause: Dependencies are out of date or have not been installed
Solution: Run make install to install latest dependencies.

"# pkg-config --cflags -- gdal gdal gdal gdal gdal gdal Package gdal was not found in the pkg-config search path."

Cause: GDAL has not been installed
Solution: Install GDAL using a package for your environment or download and build from source.

Mac

runtime error while training "joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker."

Cause: Not enough Docker resources
Solution: change Docker resources to recommended "CPU:10, RAM:10 gigs, Swap:2.5 gigs, Disk Image Size: 64 gigs"

distil's People

Contributors

Stargazers

Watchers

Forkers

fourpartswater yohannparis sonnguyenvnu

distil's Issues

Support ingest of r_ datasets

Add support for ingesting the tabular r_ datasets since they can now be semantically typed by Popily. Image / audio datasets need not be supported.

Create D3M ElasticSearch dev container

Internal mocking of ElasticSearch is already becoming fragile, and test cases that rely on it fail to test the entire code path, validating the response handling of a given query, but not the construction of the query itself. It would be more useful to execute endpoint tests against a local ElasticSearch instance with a test data subset. This would also allow us to create a full dev environment that doesn't rely on our internal shared ES. We should be able to do this by creating a Docker container which wraps up an ES instance and some pre-populated test data. The existing veld-es-redis docker container project can be modified to work with D3M ingest for this purpose.

Facets don't seem to support floating point ranges

The facets library doesn't seem to support floating point ranges, so things like percentages display with a range of 0-0 or 0-1, and don't seem to meaningfully apply filtering as a result.

generate feature list for pipeline creation

The initial release version of that ta3ta2 API uses a list of (Feature, URI) tuples to identify the list of training features to use for pipeline creation. We need to update the `ws/pipeline.go/handlePipelineCreate message to include the features in the outgoing message. It currently just includes the URIs.

PostGres storage needs limits similar to ES

Postgres queries should limit the number of returned results to improve performance. Currently the categorical histogram queries in ES limit results to the top 10 terms - in PG we seem to impost no limit which leads to big performance problems. Overall number of results on filter queries should be limited as well if they aren't. We can paginate as a future feature if needed.

Feature search/selection and ordering

The feature summaries panels need to be enhanced to support:

A search box that will display features that match the input search terms, hiding those that don't match
On and Off buttons that will allow all matched features to have their collapse state turned on/off.
Some type of UI element to sort
- alpha-numerically
- based on importance score (Popily value added on ingest)
- based on novelty score (Popily value added on ingest)
On the select screen only:
- A button to add all matched to the training set in the "available set" panel
- A button to remove all matches from the "training set" panel

golang watch and restart

Before we really start diving into the backend dev, we should find a reliable way to have the server watch and restart on any file changes. Preferably something that

Watch a given array / glob of files / dirs
Ignore a given array / glob of files / dirs
on change, build (possibly lint / vet / fmt) and
- on success, restart the executable
- on failure, continue watching, but do not run / fall back onto prev executable

ensure functional completeness of Build, Results routing

Routing for Build and Results screens is only partially implemented. Needs to be revisited once #83 is complete, or can be done at the same time.

Integrate Popily feature selection primitive output

Popily has add new primitives to aid users in the feature selection process. We need to integrate those into our ingest process so that the rankings are available to help us sort features for exploration / selection. We should probably to store them in a postgres table rather than ES - they don't need to be searchable.

blocks parts of #78

Add table to display filtered data

We need to add support for a table that renders data based on the facet state. The enable state of each facet should indicate whether or not that particular column is included in the fetched, and the range/category state for each should act as a filter on the data. A new search endpoint will be required on the back end, and a table to display the data will be required on the front end.

Result summaries should filter table data

Selecting categories / setting ranges on the active result summary should cause the data displayed in the tables to update accordingly.

Send UUID instead of URI for pipeline results

The handleCreatePipelineRequest function in ws/pipeline.go currently sends a result URI down to the client. This is a temporary measure - the URI should be mapped to some type of a unique ID on the server side before being sent to the client, and should map back from that unique ID to URI when the server needs to fetch that data on behalf of the client.

Facet loading is very slow

On my laptop it takes 30+ seconds to load the o_313 dataset, all of which is spent in the facet initialization calls. We either need to optimize this s that it runs in more than a few seconds, or make the facets load in via a paging mechanism.

implement home screen

The home screen should contain:

A search box that will jump to the search screen when clicked
A list of recently used datasets. Clicking should jump to select screen with associated filters applied.
A list of recently generated pipelines. Clicking should jump to the results screen for those pipelines.
A list of currently generating pipelines. Clicking should jump to the build screen.

This information should ideally be persisted to PostGres, but we can write it into browser storage for the time being.

Add more info to search screen

In their rolled up state, the entries in the search screen should still display summary information, rather than just a title. At minimum:

a text preview
a list of important features (based on Popily's scoring)
some basic summary information (number of records, number of features)

hardcoded exclusions in getRouteFilters are error prone

getRouteFilters() in getters.js needs a better way of handling filter exclusions

http loggin middleware

We should replace any current logging statements with higher resolution trace / debug level logs and then use a logging middleware for the http requests.

Add logging middleware to GRPC

GRPC supports middleware via its interceptor API. This can be leveraged to add logging on all ta3 <-> ta2 messaging.

results should display in table for selected pipeline

In the Results view, selecting a pipeline from the right hand side should add the predicted values to the table view, next to target values. The predicted values can be fetched from the server using the /distil/results route that currently exists.

Depends on #71.

Add score display to results

Results summaries for regression tasks should display the R^2 score only, referred to as "Fit" rather than R^2. Classifications should display Accuracy only, no rename necessary when displayed.

update to latest ta3ta2 api version

Migrate our code to use v2017.9.11 of the ta3ta2-api.

Update to latest ta3ta2 API version

Update pipeline_service.proto to the latest version
Re-generate go source
Integrate changes into distil-pipeline-server project
Integrate changes into distil project

One behavioral change that is present is the addition of the UPDATED state in the create results. UPDATED provides intermediate results that can be overridden by a subsequent UPDATE or COMPLETED result set.

Add option to ingest to use metadata types rather than semantic classification output

The semantic classification output isn't going to be 100% accurate. We will be adding interface elements to allow the users to override supplied types, but for now we should provide an option to load the data types from the dataset metadata files.

running results don't display in Results view when done

The Build view allows for the user to click on the pipelines for a given create request and display them as pending in the Results view. They currently remain in the pending state regardless of whether or not the results are available.

PipelineCreateResult data should persist to PG

PipelineCreateResult data is currently cached in memory, and is lost on server restart. That information should be written into PG so that it can be preserved across sessions.

Bucket size selection should account for discrete vs. continuous data

The bucket size selection for the facet histograms should take into account whether the data is discrete or continuous. The discreet case should enforce a bucket min size of 1 regardless of the requested count, while continuous can select an unbounded bucket size.

mean absolute error badge incorrect

Badge for mean absolute error badge is incorrect in build screen

Add pipeline creation interface

Add an interface to launch a pipeline create request given the feature/filter state applied by the facets. The interface should allow the user to input the target feature, task, and metric for scoring, and initiate a pipeline create request based on that configuration. The application should track the status of the request and accumulate results as they complete, and support user browsing of the results.

Regression result enhancements

Results for regression tasks should include a distribution of the ground truth values as a facet histogram above the the actual predicted result histograms. This facilitates understanding of the error distribution. An error range slider should be added allow the user to select the range for which a prediction is considered acceptable - we can hard code this to root mean squared error for now. The results should be displayed in 2 vertically stacked tables - one contain records that were inside the error acceptable error range, the other containing those that are outside the acceptable error range.

Display dataset search matches

The description text for datasets matched via search should be highlighted. A list of any matched variable names should be included as a header as well, with the substring that triggered the match highlighted.

Postgres implementation needs to handle included fields

The ES implementation used the concept of excluded fields to filter which fields to return for analysis. SQL does not support field exclusion but instead requires the inverse list of fields to include. Currently all fields are returned regardless of user selection. The Postgres implementation needs to be updated to use the included fields list to filter the returned fields.

need radio-group style selection of facet groups

On the right side of the Results view, we display a facet group for each pipeline created as part of a request. We should be able to set one of these groups into a selected state and disable the others, similar to what happens with a radio button group. Threshold capability would be to just use the existing toggle facility, and have the Facets widget manage the toggle state appropriately.

Additional build options

Options in the build screen should include a text input box so that the user can enter a description, and a numeric entry to set max number of pipelines to process.

Deconflict UI in Select and Build screens

The Build screen currently has UI for selecting a target feature, scoring metric, and initiating the pipeline create request. This conflicts with target feature selection in the Select screen. At minimum, we should ensure the target from the select screen is pre-populated when the Build screen is displayed, along with a default scoring metric.

Pipeline names need to be improved

The displayed pipeline name is currently generated by concatenating the dataset name, target variable, and the first of the the pipeline ID. This is should be replaced with something that is more user-friendly.

support for categorical summaries

VariableSummariesHandler in routes/routes.go is currently only generating summaries for float/int variables. It should also generate them for categorical variables, with the caveat that categorical variables can consists of text-based labels, or numeric values that denote a category. Histograms should be based on a term count, and should be displayed in the UI using the the vertical display, rather than the horizontal histogram.

link table and facets on mouse move

When the user mouses over a table row the appropriate buckets / categories each facet should be highlighted. When a user mouses over a bucket/category in a facet, the corresponding table rows should be highlighted. Clicking on a table row should make the facet highlights stick, and clicking on the facet should make the table highlights stick.

Mousing over hall of fame '0' category locks app

If you mouse over the 0 category bar (has count of 960) for the hall of fame field it locks the app up. Similar behavior is expected for any feature value that has a high count.

Split classification output into 2 tables

The results screen should display results as two vertically stacked tables - one showing records in which the predicted value was correct, the other showing records in which the predicted value is incorrect.

postgres eventually stops returning results

I can reproduce this fairly reliably on master with PG enabled by getting run output in the Results screen and clicking through the result sets on the right a number of times. Each click should result in a corresponding result fetch going out in the Network tab of the chrome debugger - usually after about 7 or 8 fetches a request will remain in the pending state, and there won't be any errors reported in the app or postgres logs.

Upgrade to Elastic Search v5

ElasticSearch v3 and older only allow for integral bucket sizes for histogram aggregations, which causes histograms for small ranges of floating point data to have limited display precision. ElasticSearch v5 supports float bucket sizes, which will correct the problem. We should update our internal dev ES instance to v5, our source to use Elastic Go with v5, and the histogram code to use floats for the bucket sizes.

optional redis cache middleware

We should add an optional caching middleware to our http requests. If no redis instance is available, simply bypass, if one is, pull responses from that instead.

Add Export and Abort options

NIST requires two elements be present in the UI to facilitate the eval:

An Export button. When clicked, this needs to call ExportPipeline from the TA3/TA2 interface. This signals the user is done, and exports the selected pipeline. We can probably make this a button in the Results summary, and have it pop up a dialog warning the user that it signals the problem is complete.
An Abort button. When clicked, this terminates the session, signalling that the user cannot complete the problem in question (no support for particular task, etc). There may be a need to call a NIST API to ensure TA2 is shut down.

Add ability to save / load a filtered dataset

We need to support the ability to save / load the edited version of a dataset. Probably the easiest way for us to do this currently is to store the filter state in the browser data store. Based on the outcome of TA3/TA2 interface discussions, we would either need to send this state to TA2 as part of a model run request, or write a new version of the dataset that has the columns / rows removed out to HDFS or some other type of external store.

Build screen should fetch running/completed from server

The Build screen currently only displays runs that have been launched in the current web session. It should query the server and display those that have previously executed or are currently running.

Markdown rendering for description text

D3M test datasets have markdown descriptions that we are just rendering as raw text. We should be able to display the text with formatting applied either using a server side or a client side markdown rendering library. Text that is not markdown should pass through unchanged.

Categorical facets should support field toggle

The vertically displayed categorical facets should support a toggle on the displayed categories to allow for filtering. Back end support is already in place.

Use docker compose to launch dev env

We currently have a selection of shell scripts to launch our development containers. Docker compose should allow us to manage config / lifecycle of all them at once.

Results screen regression error slider always initializes to zero

The regression error slider should be initialized to the 25th percentile of the error range, but instead always shows up as zero. The causes the "correct results" table to be empty, which is confusing for users.

build info should be embedded in executable and logged

Go allows for the the build step to inject symbols into the compiled executable, allowing for things like the commit associated with a given build, timestamp etc to be embedded and logged. This often really useful for debugging integration issues as it disambiguates the content of a binary.