nasa-jpl-memex / sce-domain-discovery Goto Github PK

View Code? Open in Web Editor NEW

5.0 7.0 8.0 11.29 MB

Domain Discovery for the Sparkler Crawl Environment

License: Apache License 2.0

Java 52.87% Python 40.81% Dockerfile 0.72% Shell 5.60%

domain-discovery crawling crawling-framework python flask svm-model svm-training sparkler irds usc

sce-domain-discovery's Introduction

polar-domain-discovery

Domain Discovery on Any Domain

sce-domain-discovery's People

Contributors

Stargazers

Watchers

Forkers

davtalab uscdatascience arunsigood digitalcompanion 5l1v3r1 socioprophet

sce-domain-discovery's Issues

Show alert on successful model updates

We should bring back the alert message on successful model updates

search box improvements

the search box should except advanced search capabilities, eg adding a - before a word should black list it from the search

This should be common functionality for our search engine, we just need to be able to pass it on - then we can include advanced search instructions as well.

Show progress bar of URLs being processed when a query is performed

When the user performs a search query, show a progress bar or something similar rather than showing the infinity gif. This is because the gif does not indicate if firefox crashed or the parsing is taking longer.

We need to have a /status endpoint that the UI can query to get the status of the system, eg working, exception, idle etc.

Popup info on the main interface

The following items in the interface should have tooltips added to them:

Highly Relevant buttons: "Highly Relevant: This page contains the type of information required to answer questions for this domain. This is exactly what we are looking for.'
Relevant buttons: "Relevant: This page contains domain-relevant information, but it won't help answer questions."
Not Relevant: "Not Relevant: This page is irrelevant to the domain."

Add Launch Crawl Button

If there is a seed file and at least 10 of each type of relevancy marked, the crawl button should be usable. It will just launch the crawl.

UI fixes

just to list some of the thing that need to be resolved in the future:

account for different browser widths - narrow looks really bad right now
make sure iframe boxes grow to match content (more than two lines at the top of a box pushed the content out the bottom currently
separate crawl model tools from the tools to run a crawl
add hovertips on the left side column tools, especially to distinguish stop from halt crawl
make sure the correct pointer shows up on hover in the left side bar - many of them are the text pointer instead of the arrow pointer for some reason
Make it clear what the minimum steps are for launching a useful crawl - have them light up in order or the crawl button isn't available until they are completed, or something else. We'll discuss it.
Improve the number of pages that show in the iframes.
8.Review default-to-non-relevant on the first search. I'm not sure I'm sold on this.
update model notification should tell how many of each relevancy were marked as well as the new totals - if they have reached 10 of each, it should say "You can now launch a crawl."
the metrics wrap strangely once 2 or more reach double digits
fix colors - red means stop and green means go - almost universally and here they are switched - intuitively very confusing

Make useful metrics

Either make Accuracy mean something or replace it with something else useful, eg Number of Highly Relevant, Relevant, and Not Relevant pages that have been marked.

Stop Crawl Button

If we're going to launch a crawl that will run forever, we need a stop crawl button also.

Improved Seed file management

When a seed file is uploaded, it's name should be displayed in the interface.
Clicking on this name should bring up a list of the urls in the file.
There may be some additional capabilities that should go along with this. Let's discuss before working on it.

enter launches search

It is default internet usage. Please don't make me click the magnifying glass!!

Using provided seeds to improve domain discovery model

The seeds are the SMEs best proposal of exactly what they are looking for. They should be added to the domain discovery model as "highly relevant".

Create save and load functionalities

The users should be able to save and load a model on the server. For instance, the user can save a model on the server by giving a name and load it subsequently. All the saved models must be reported in a list where the users can select the model to be used.
These new functionalities will replace import and export (the model cannot be exported on local).

Add Dashboard button

Once the crawl is launched, the Dashboard button should be usable as well. This is just a link to /banana.

All of these buttons should be in the left hand side bar.

Show current label/model statistics on the UI

The UI should show the user the current number of pages labelled and the numberof pages in each label and the urls in those labels.

Have the results be marked as non -relevant by default

The user is getting confused with the meaning of non-relevant, provide a way to mark all by default as non relevant and ask the user to only click the ones that are interesting. This may reduce the number of labels we have to two

Add Upload Seed File functionality

For this first pass, it should accept a .txt file only and do some basic checking to make sure that the file contains useful urls.
The file should then be saved to the appropriate location on the server, and ideally the name of the file will be shown on the interface.

Interface enhancements

Just a few things - should just be cosmetic and relatively simple

Change "Build Model" to "Create New Model"
Move the import and export model functionality to the left side bar underneath "Create New Model"

Managing models

Once we have reworked the model saving system in #7, this is how we make managing models more intuitive.

Each model in the list of models should have a drop down list next to it that contains the following capabilities:

Set as Crawl Model If set as the crawl model, the model is the one that will be used by Sparkler for determining relevance. When this happens, the model name will be all pretty - different color/bg, bold, whatever - but it needs to stand out - and also have a hover tip that says: "Active Crawl Model"
Rename The user can give the model a new name, since by default it is names after the first query run to create it. These may be non-intuitive and duplicative.
Export The user can export the model and save it on their computer.

In addition, the following changes to the interface will need to take place:

The Selected Model is the model that is being trained. It's name is shown above the "Create New Model" link and is in bold and a different color or whatever to make it obvious that it's the current one. A hover tip should be added that says, "You are training this model now."
Remove the Export model button from the left hand column
Importing a model will cause it to become the selected model (not the crawl model).
Selecting a model from the list will make it the selected model.
It should be possible to select the Crawl Model, eg a model can be the crawl model and be actively trained at the same time (this is current functionality).

The left hand side column should contain these things, in this order:
Search box
Selected Model Name
"Create a New Model"
Import a Model
List of models saved on server

Let users mark a subset of the results

The users should not be forced to mark all the 12 results, but they should be able to mark a subset of the results. This may happen, for example, when users forget to mark some result or do not want to mark a specific result. In any case, no error has to be reported when one or more results have not been marked.

Dump data button

This will complete all of the functionality that we created for the DD Eval.