phurwicz / hover Goto Github PK

:speedboat: Label data at scale. Fun and precision included.

Home Page: https://phurwicz.github.io/hover

License: MIT License

Python 85.44% Jupyter Notebook 14.56%

visualization machine-learning bokeh data-labeling supervised-learning bulk-labeling text-classification image-classification audio-classification annotation-tool

hover's People

Contributors

Stargazers

Watchers

Forkers

rohitpandey13 stjordanis jadidaniel maxcodextc mrbleaney lkafle vickzhang trendingtechnology dimaclaudiu haochuanwei joe-nano databill86 robinsonkwame wx-b phymucs florianbertonbrightclue ryeznguyen

hover's Issues

Feedback on Quickstart

For the step "Ingredient 2 / 3: Embedding", I had to follow the instructions on space to download the model... Maybe this is a case where "if you can't figure out how to get a spacy model, you shouldn't be using hover" ;-)

python -m spacy download en_core_web_md

Also, when running in NOTEBOOK MODE with Jupyter,

This command didn't work:

show(interactive_plot, notebook_url='https://localhost:8888')

But I tried just

show(interactive_plot)

and had success, at least in showing the widgets right in the notebook....! Now, I noticed that they don't quite fit, but that might be another issue.

Should vectorizer be limited to returning 1-D arrays only?

This is to document some thoughts on

whether to allow vectorizers to output 2-D arrays, or even n-D arrays

which currently these classes/methods make use of:

hover.core.dataset.SupervisableDataset.compute_2d_embedding()
hover.core.neural.VectorNet
hover.core.neural.MultiVectorNet

Arguments for n-D array support

for image / audio processing, 2-D arrays are closer to the original input.
for sequence models in NLP, a doc with token-level embeddings is a 2-D array. Aggregating into 1-D can lose information.

Arguments against n-D array support

hover.core.dataset.SupervisableDataset.compute_2d_embedding() it is unclear to the dimensionality reducers how to take advantage of 2-D localities.
- Of course, if there is no locality, one can simply flatten 2-D into 1-D.
- Could consider using a if-not-1D preprocessing function, which defaults to something like array.flatten().

multi labeling

I probably overlooked it somewhere but does Hover support multi labeling? As a kludge I can create new labels with commas between them but I'd thought I'd ask here.

bulk labeling selection should automatically update "view selection"

Hi, I'm running the quickstart0 code.

Current behavior: after every selection (using lasso tool or poly tool) I have to press the button "View selected" so I can update the table
Desired (new) behavior: I would like this button to not exist and it should automatically update every time the selection is updated (be it lasso or poly tool)

I'll be happy to dig in the code, if you direct me to a general direction of where to look, I'm not sure how easy it is to write a hook for the bokeh plot

QuickStart via Docker Container

Feel like this could be made to be very easy to demo/test, if it were possible to run some of the core components (UI) as a docker image, and then the data was loaded via an API.

Setting custom colors

The default gray for ABSTAIN labels is difficult to see on my screen. I wondered if there was an easy way to set custom color per label.

Add tooltip support for image/audio data

Requested by /u/svldsmnn on reddit, e.g. audio playback on mouse hovers or clicks (TapTool can be a good idea if the crosshair hits multiple points and can't decide which one to pick up).

For running audio playbacks, This gist can be helpful despite coming from a time of bokeh 0.12.

Also there should be image display in the tooltip, should something like an "image" field be present.

Unable to run the quick-start

I tried to run the quick start code, but jupyter show nothing after I run:

from bokeh.io import show, output_notebook
output_notebook()
show(handle)

Did I miss something？

Do not re-plot just to update glyphs

When we plot the same source a second time,

the renderers of the first-time plot sticks around,
the second-time renderers seem not to come with tooltips.

If the point is to update glyph attributes, do so through dynamically changing source.data.

Neural net training breaks when the dev set is empty

In hover.core.dataset:

# edge case: valid slice is too small
        if df.shape[0] < 1:
            raise ValueError(f"Subset {key} has too few samples ({df.shape[0]})")
        batch_size = min(batch_size, df.shape[0])

Also, one needs to be careful with batch_size since some neural net layers are incompatible with certain sizes. BatchNorm and batch_size=1 is a good example.

Associating non_feature data with feature_key

There are often metadata associated with the feature data; for example, text comes from certain documents. After labeling the raw data it's often useful to merge the labels with the metadata for other data science tasks. For example, some sets of documents or locations might not contain a labels that you would otherwise expect them to. Or you want to aggregate counts by document or location.

Is there a way for SuperisableTextDataset to include non_feature data? non_feature data could store this kind of metadata. The subset row order differs from the raw data frame so you can't just match indices.

Improve search widget for image data

Feb 24 2022: currently using vector search.

Image search is rather unclear compared with text. Things to consider:

what kind of Bokeh widget to use. Could be FileInput for uploads or TextInput for urls
what kind of search is appropriate (as in, making semantic sense and trying to be independent from the vectorizer/dimensionality reduction)
- look into Structural Similarity
- more research would be helpful

Use of semi-supervised fit step for Umap and ivis

Hello,

Thank you so much for providing hover as an open source tool!

I was wondering if it would be possible to have the option to make a semi-supervised fit with umap or ivis.
Indeed, from what I understand, both umap and ivis are fit in an unsupervised way: only the embedding information is used in the fit step.
For data belonging to the train and dev sets (public sets from hover implementation), hover knows what class they belong to. As both umap and ivis support providing labels during the fit step (-1 if no label is known, to do semi-supervised fit), I was wondering if you considered adding the class information in the fit step?

Checklist: 0.7.0 release

In this specific order:

core and recipe signatures locked in
all source code locked in
full automated test on different OS
manual interaction tests @phurwicz @haochuanwei
check CHANGELOG
check setup.py and conda recipe
check README and description
published to PyPI
published to conda-forge

Custom bokeh tooltip breaks soft label/score display

Using a tooltips kwarg in recipes overrides the default soft label/score fields that BokehSoftLabelExplorer uses.

[todo] Turn demos into standard recipes

To reduce the amount of code that one needs to get (truly) started, we need carefully written recipes which take df/dictl + vectorizer arguments and handle sanity checks along with everything onward.

Getting GCC compile error for installation on MacOS

I am trying to install hover on my MacOS Monterey which has the miniforge python installation (python 3.8). I get following error message -- distutils.errors.CompileError: command 'gcc' failed with exit status 1

Maybe a conda package will help with this?

Add a fill_alpha slide bar to BokehSoftLabelExplorer

Context: a fill_alpha of 0.6 vs 0.8 can look almost the same. Right now fill_alpha is computed by fill_alpha = label_score, but we should consider the possibility of 'zooming into' a user-specified range to provide better distinction between points.

Phase out soft label / denoising components

hover itself does not produce soft labels, hence cross_entropy_with_probs is only relevant when we do label smoothing. This can be achieved in torch like described here. It requires torch>=1.10.0.

Co-teaching based stuff in hover.utils.denoising is an over-stretch here with too much background for the vast majority of intended users. It's hard to justify using a specific piece of research in a library like hover with almost no ties to it.

Accepting Huggingface transformers

Does Hover include abstractions over text vectorizers? I wanted to use huggingface models, all-MiniLM-L6-v2 specifically over Spacy, for vectorizing text and made a facade class; I thought I could contribute it but I'm not sure.

Using huggingface would require sentencetransformers and there might not be class abstractions over text (or otherwise) vectorizers?

Session not accessible on browser refresh

I accidentally hit back while labeling and the Bokeh ioloop was unable to retrieve the prior session,

[ioloop.py:760 - _run_callback()] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f672643b6d0>>, <Task finished name='Task-64563' coro=<ServerSession.with_document_locked() done, defined at /hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/server/session.py:78> exception=UnsetValueError("DataTable(id='1019', ...).view doesn't have a value set")>)
Traceback (most recent call last):
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/hover/utils/bokeh_helper.py", line 116, in load
    layout.children.append(bokeh_model)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/wrappers.py", line 140, in wrapper
    self._notify_owners(old)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/wrappers.py", line 169, in _notify_owners
    descriptor._notify_mutated(owner, old, hint=hint)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/descriptors.py", line 596, in _notify_mutated
    self._set(obj, old, value, hint=hint)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/descriptors.py", line 559, in _set
    self._trigger(obj, old, value, hint=hint, setter=setter)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/descriptors.py", line 637, in _trigger
    obj.trigger(self.name, old, value, hint, setter)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/model.py", line 564, in trigger
    self.document.models.invalidate()
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/document/models.py", line 184, in invalidate
    self.recompute()
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/document/models.py", line 204, in recompute
    new_models |= mr.references()
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/model.py", line 441, in references
    return set(collect_models(self))
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/util.py", line 159, in collect_models
    return collect_filtered_models(None, *input_values)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/util.py", line 139, in collect_filtered_models
    visit_immediate_value_references(obj, queue_one)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/util.py", line 205, in visit_immediate_value_references
    child = getattr(value, attr)
  File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/descriptors.py", line 234, in __get__
    raise UnsetValueError(f"{obj}.{self.name} doesn't have a value set")
bokeh.core.property.descriptors.UnsetValueError: DataTable(id='1019', ...).view doesn't have a value set

so I lost about an hour of labeling work. Fortunately, I exported the dataset minutes before this happened and I hope I can load it as a SupervisiableDataset and start where I left off.

EDIT - I was able to restore with SupervisableTextDataset.from_pandas and making sure that df[SUBSET] was not set to raw.

hover configuration for constants

There are several constants that if configurable would allow the user to more flexibly use Hover. For example,

There are several uses cases involving more than 20 labels where the user would be okay with repeated colors, currently hover throws an assert error: File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/hover/utils/bokeh_helper.py", line 22, in auto_label_color assert len(use_labels) <= 20, "Too many labels to support (max at 20)"
An earlier issue discusses customizing color

These and other configurability issue could be addressed by using Python's standard library configparaser. If the user changes the Hover configuration they would understand that the maintainers may not be able to help them.

Checklist: 0.6.0 release

In this specific order:

core and recipe signatures locked in
all source code locked in
full automated test on different OS
manual interaction tests @phurwicz @haochuanwei
check CHANGELOG
check setup.py and conda recipe
check README and description
published to PyPI
published to conda-forge

Rework rule-based labeling function mechanism

We need to address the interactivity of rules, i.e. the ability to create them dynamically without significant explorer overhead.

for example, adding just one rule for crosscheck should not require invoking a whole recipe again.
there's also no point to stick the naming to snorkel-like, since we don't go any further than applying rules.
- consider dropping the snorkel dependency altogether because the package is no longer actively maintained.
- this also drops the assumption that "rules only deal with text data".

Below let's list some options in increasing order of flexibility(?) and decreasing order of feasibility.

Option 1: (most feasible) in-Jupyter-only solution

Give a reference (of the list of rules) to the crosscheck recipe, which has a callback that looks up the reference and runs latest rules.
Because this is in Jupyter, python kernel and coding area for the rules are easily available.

Option N: (most flexible) in-any-server solution

Use some widgets to provide the mechanism of adding/removing rules.
This will require a text input, a toggle list of current rules, and a remove button.
The tricky part will be dynamically turning the text input safely into rules and keeping track of them.

Reduce memory usage

A few goals:

determine if hover.recipes.subroutine uses any significant amount of excess memory.
fit demos into streamlit sharing's 800MB limit.

Clean up the dependency specifications / makefile

Put boilerplate configurations in tox and make Makefile cleaner.
Get rid (the need) of requirements-dev.txt and requirements-test.txt if possible.
Make sure to grep for anything (README.md for example) that points to the to-be-removed files.

Add an FAQ page to the docs

There are some best practices that aren't part of the package but should be easy to find.

For example, how to save and restore annotation work #47 can be the first FAQ.

Auto coloring is broken

Expected behavior: as one annotates a dataset and possibly adds classes,

the label -> color mapping should stay consistent as much as possible, and
the color of the data points must be consistent with the color of the legend.

Right now (in 0.4.0) both of these behaviors are broken.

[todo] Link color schemes between plots

When two or more plots base their color schemes on label classes, it would be much clearer if each class corresponds to the same color in all the plots where it shows up.

Scalable Recipe Tests

Sketch / Food for Thought

Current testing procedure
SELECTING POINTS TO LABEL

make selection in
[movable] enter in annotator_input
action_view_selection() -> check selection table
make sub-selection in selection table
action_evict_selection() -> check old/new selection
to be continued. Apply twice then commit? Commit twice then deduplicate? Repeated deduplication should have no effect? Neither does repeated push? Check selection sync across multiple explorers at random points? Generate filter conditions and expected outcome?

The state of a plotted recipe

dataset/explorer consistency
- rows for each subset
  - recovered by PUSH
  - affected by COMMIT, DEDUP
  - unchanged by VIEW, EVICT, PATCH
- data values (including SUBSET)
  - recovered by PUSH
  - unchanged by
dataset/selection table consistency: mutable data values
explorer/selection table consistency
- selected indices
  - recovered(or reset) by VIEW, PUSH
  - unchanged by PATCH
  - affected by COMMIT, DEDUP, EVICT
- data values (including SUBSET)
  - recovered(or reset) by VIEW, PUSH
  - unchanged by EVICT
  - affected by COMMIT, DEDUP, PATCH
widget states
- checkbox toggles
- input/slider values

Enable tooltips in selection table

In the selection table it can be difficult to evict data points when the entire text may be needed for labeling context. This especially true for meaning derived from long range dependence between phrases. Currently that table cuts off the text. Bokeh DataTable depends on strongly SlickGrid that does not easily allow row height to be dynamic. So, instead, to get full labeling text, I suggest that the text column be set with a tooltip template, eg. add a HTMLTemplateFormatter to "text" with

    if feature_key == "text":
        feature_col_kwargs["formatter"] = HTMLTemplateFormatter(
            template="""<span href="#" data-toggle="tooltip" title="<%= value %>"><%= value %></span>"""
        )

and if the text is too long yet someone wants to inspect it w/o figuring which of the many points it is they can just hover over it.

Add support for higher dimensionality than 2

As we all know, as amazing as algorithms like UMAP are, they are still missing massive amounts of explained variance compared to significantly higher N levels of dimensionality. This is why the dimensionality of "word vectors" or of embeddings in general continues to go up... it really does improve the down-stream performance of the model.

I am not sure as to how refined Bokeh is with user editable 3D plots, but it should be explored. I defenitely implemented this sort of functionality (poorly) with Matplotlib on a 3D scatterplot back in 2018 for my own text-project

Besides being even cooler for users - this would make the ideal use-cases of this model (fast data-labeling, effective active learning) far far better. Consider implementing this as a large number of 2D scatterplots plotting various dimensions against each other if 3D scatterplots won't work within Bokeh.

Add support for image data

Prioritized feature.

Bokeh is obviously capable of including images in its HoverTool.

Hover's core classes assumed no specific data format, but only the subclasses for text data are adequately implemented and tested so far.

Tutorials

The quick-start is a little too brief. Here's a sketch of a tutorial notebook (possibly split into multiple sections):

simple_annotator in action
under the hood: SupervisableDataset
under the hood: BokehCorpusAnnotator
subsets / "layers" and data management
vectorizer alternatives & best practices (caching and persisting, for example)
dimensionality reduction alternatives & resources
the power of linking selections & ranges
add-on: a better search thru BokehCorpusExplorer
add-on: active learning help thru BokehSoftLabelExplorer
add-on: distant supervision help thru BokehSnorkelExplorer
add-on: A/B tests thru BokehMarginExplorer
to be continued..

Add support for "drawing your own decision boundry" to implement machine teaching

I love this tool. I've been using Bokeh along with UMAP/Ivis/PCA and clustering for dataset visualization like this for awhile - but I am happy to see someone automate this exact use-case since I've had to hand-roll this kind of tool for my own clustering / dimensionality reduction projects many times.

I think the logical extension to a tool like this is allowing someone to define their own decision boundary of a supervised model (they call this "machine teaching" rather than machine learning). Defining their own decision boundary should end up with them having a supervised classifier at the end and being able to visualize how that classifier operates (and ideally allowing an expert human to "tune" it). Note that this is different than the current "select aspects of the dataset by drawing" functionality built in.

One easy way to implement this is to allow the user to "draw" like you do earlier - but then making it where the user is actually drawing a "psudo-subset" (but is actually creating new data) of their initial data. Fit the classified model on this "psudo-subset", and it should end up training fast and giving the user some kind of "equation" (e.g if you choose linear models) or some other interpretation mechanism (e.g. decision trees). When the expert changes bits of how this supervised model works - the model equation or interpretation should update. No need to do CV since it's human eye-balls giving you your regularization for you.

It's a lot of work but I anticipate that if you implement it correctly you'd be well into the thousands of github stars because it's fking obvious but is a huge win in situations where say, a doctor may in fact be capable of "fixing" erroneous parts of a medical imaging AIs decision boundary.