phurwicz / hover Goto Github PK
View Code? Open in Web Editor NEW:speedboat: Label data at scale. Fun and precision included.
Home Page: https://phurwicz.github.io/hover
License: MIT License
:speedboat: Label data at scale. Fun and precision included.
Home Page: https://phurwicz.github.io/hover
License: MIT License
For the step "Ingredient 2 / 3: Embedding", I had to follow the instructions on space to download the model... Maybe this is a case where "if you can't figure out how to get a spacy model, you shouldn't be using hover" ;-)
python -m spacy download en_core_web_md
Also, when running in NOTEBOOK MODE with Jupyter,
This command didn't work:
show(interactive_plot, notebook_url='https://localhost:8888')
But I tried just
show(interactive_plot)
and had success, at least in showing the widgets right in the notebook....! Now, I noticed that they don't quite fit, but that might be another issue.
This is to document some thoughts on
whether to allow
vectorizer
s to output 2-D arrays, or even n-D arrays
which currently these classes/methods make use of:
hover.core.dataset.SupervisableDataset.compute_2d_embedding()
hover.core.neural.VectorNet
hover.core.neural.MultiVectorNet
hover.core.dataset.SupervisableDataset.compute_2d_embedding()
it is unclear to the dimensionality reducers how to take advantage of 2-D localities.
I probably overlooked it somewhere but does Hover support multi labeling? As a kludge I can create new labels with commas between them but I'd thought I'd ask here.
Hi, I'm running the quickstart0 code.
Current behavior: after every selection (using lasso tool or poly tool) I have to press the button "View selected" so I can update the table
Desired (new) behavior: I would like this button to not exist and it should automatically update every time the selection is updated (be it lasso or poly tool)
I'll be happy to dig in the code, if you direct me to a general direction of where to look, I'm not sure how easy it is to write a hook for the bokeh plot
Feel like this could be made to be very easy to demo/test, if it were possible to run some of the core components (UI) as a docker image, and then the data was loaded via an API.
The default gray for ABSTAIN labels is difficult to see on my screen. I wondered if there was an easy way to set custom color per label.
Requested by /u/svldsmnn
on reddit, e.g. audio playback on mouse hovers or clicks (TapTool
can be a good idea if the crosshair hits multiple points and can't decide which one to pick up).
For running audio playbacks, This gist can be helpful despite coming from a time of bokeh 0.12
.
Also there should be image display in the tooltip, should something like an "image" field be present.
I tried to run the quick start code, but jupyter show nothing after I run:
from bokeh.io import show, output_notebook
output_notebook()
show(handle)
Did I miss something?
When we plot the same source a second time,
If the point is to update glyph attributes, do so through dynamically changing source.data
.
In hover.core.dataset:
# edge case: valid slice is too small
if df.shape[0] < 1:
raise ValueError(f"Subset {key} has too few samples ({df.shape[0]})")
batch_size = min(batch_size, df.shape[0])
Also, one needs to be careful with batch_size
since some neural net layers are incompatible with certain sizes. BatchNorm
and batch_size=1
is a good example.
There are often metadata associated with the feature data; for example, text comes from certain documents. After labeling the raw data it's often useful to merge the labels with the metadata for other data science tasks. For example, some sets of documents or locations might not contain a labels that you would otherwise expect them to. Or you want to aggregate counts by document or location.
Is there a way for SuperisableTextDataset to include non_feature
data? non_feature
data could store this kind of metadata. The subset row order differs from the raw data frame so you can't just match indices.
Feb 24 2022: currently using vector search.
Image search is rather unclear compared with text. Things to consider:
Bokeh
widget to use. Could be FileInput
for uploads or TextInput
for urlsHello,
Thank you so much for providing hover as an open source tool!
I was wondering if it would be possible to have the option to make a semi-supervised fit with umap or ivis.
Indeed, from what I understand, both umap and ivis are fit in an unsupervised way: only the embedding information is used in the fit step.
For data belonging to the train and dev sets (public sets from hover implementation), hover knows what class they belong to. As both umap and ivis support providing labels during the fit step (-1 if no label is known, to do semi-supervised fit), I was wondering if you considered adding the class information in the fit step?
In this specific order:
Using a tooltips
kwarg in recipes overrides the default soft label/score fields that BokehSoftLabelExplorer uses.
To reduce the amount of code that one needs to get (truly) started, we need carefully written recipes which take df/dictl + vectorizer arguments and handle sanity checks along with everything onward.
I am trying to install hover
on my MacOS Monterey which has the miniforge
python installation (python 3.8). I get following error message -- distutils.errors.CompileError: command 'gcc' failed with exit status 1
Maybe a conda package will help with this?
Context: a fill_alpha of 0.6 vs 0.8 can look almost the same. Right now fill_alpha is computed by fill_alpha = label_score
, but we should consider the possibility of 'zooming into' a user-specified range to provide better distinction between points.
hover
itself does not produce soft labels, hence cross_entropy_with_probs
is only relevant when we do label smoothing. This can be achieved in torch like described here. It requires torch>=1.10.0
.
Co-teaching based stuff in hover.utils.denoising
is an over-stretch here with too much background for the vast majority of intended users. It's hard to justify using a specific piece of research in a library like hover
with almost no ties to it.
Does Hover include abstractions over text vectorizers? I wanted to use huggingface models, all-MiniLM-L6-v2
specifically over Spacy, for vectorizing text and made a facade class; I thought I could contribute it but I'm not sure.
Using huggingface would require sentencetransformers
and there might not be class abstractions over text (or otherwise) vectorizers?
I accidentally hit back while labeling and the Bokeh ioloop was unable to retrieve the prior session,
[ioloop.py:760 - _run_callback()] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f672643b6d0>>, <Task finished name='Task-64563' coro=<ServerSession.with_document_locked() done, defined at /hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/server/session.py:78> exception=UnsetValueError("DataTable(id='1019', ...).view doesn't have a value set")>)
Traceback (most recent call last):
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/hover/utils/bokeh_helper.py", line 116, in load
layout.children.append(bokeh_model)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/wrappers.py", line 140, in wrapper
self._notify_owners(old)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/wrappers.py", line 169, in _notify_owners
descriptor._notify_mutated(owner, old, hint=hint)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/descriptors.py", line 596, in _notify_mutated
self._set(obj, old, value, hint=hint)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/descriptors.py", line 559, in _set
self._trigger(obj, old, value, hint=hint, setter=setter)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/descriptors.py", line 637, in _trigger
obj.trigger(self.name, old, value, hint, setter)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/model.py", line 564, in trigger
self.document.models.invalidate()
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/document/models.py", line 184, in invalidate
self.recompute()
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/document/models.py", line 204, in recompute
new_models |= mr.references()
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/model.py", line 441, in references
return set(collect_models(self))
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/util.py", line 159, in collect_models
return collect_filtered_models(None, *input_values)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/util.py", line 139, in collect_filtered_models
visit_immediate_value_references(obj, queue_one)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/model/util.py", line 205, in visit_immediate_value_references
child = getattr(value, attr)
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/bokeh/core/property/descriptors.py", line 234, in __get__
raise UnsetValueError(f"{obj}.{self.name} doesn't have a value set")
bokeh.core.property.descriptors.UnsetValueError: DataTable(id='1019', ...).view doesn't have a value set
so I lost about an hour of labeling work. Fortunately, I exported the dataset minutes before this happened and I hope I can load it as a SupervisiableDataset and start where I left off.
EDIT - I was able to restore with SupervisableTextDataset.from_pandas
and making sure that df[SUBSET
] was not set to raw
.
There are several constants that if configurable would allow the user to more flexibly use Hover. For example,
File "/hdd/work/anaconda3/envs/qca/lib/python3.10/site-packages/hover/utils/bokeh_helper.py", line 22, in auto_label_color assert len(use_labels) <= 20, "Too many labels to support (max at 20)"
These and other configurability issue could be addressed by using Python's standard library configparaser. If the user changes the Hover configuration they would understand that the maintainers may not be able to help them.
In this specific order:
We need to address the interactivity of rules, i.e. the ability to create them dynamically without significant explorer overhead.
snorkel
-like, since we don't go any further than applying rules.
snorkel
dependency altogether because the package is no longer actively maintained.Below let's list some options in increasing order of flexibility(?) and decreasing order of feasibility.
Give a reference (of the list of rules) to the crosscheck recipe, which has a callback that looks up the reference and runs latest rules.
Because this is in Jupyter, python kernel and coding area for the rules are easily available.
Use some widgets to provide the mechanism of adding/removing rules.
This will require a text input, a toggle list of current rules, and a remove button.
The tricky part will be dynamically turning the text input safely into rules and keeping track of them.
A few goals:
hover.recipes.subroutine
uses any significant amount of excess memory.streamlit
sharing's 800MB limit.tox
and make Makefile
cleaner.requirements-dev.txt
and requirements-test.txt
if possible.grep
for anything (README.md
for example) that points to the to-be-removed files.There are some best practices that aren't part of the package but should be easy to find.
For example, how to save and restore annotation work #47 can be the first FAQ.
Expected behavior: as one annotates a dataset and possibly adds classes,
Right now (in 0.4.0) both of these behaviors are broken.
When two or more plots base their color schemes on label classes, it would be much clearer if each class corresponds to the same color in all the plots where it shows up.
Current testing procedure
SELECTING POINTS TO LABEL
The state of a plotted recipe
In the selection table it can be difficult to evict data points when the entire text may be needed for labeling context. This especially true for meaning derived from long range dependence between phrases. Currently that table cuts off the text. Bokeh DataTable depends on strongly SlickGrid that does not easily allow row height to be dynamic. So, instead, to get full labeling text, I suggest that the text
column be set with a tooltip
template, eg. add a HTMLTemplateFormatter
to "text" with
if feature_key == "text":
feature_col_kwargs["formatter"] = HTMLTemplateFormatter(
template="""<span href="#" data-toggle="tooltip" title="<%= value %>"><%= value %></span>"""
)
and if the text is too long yet someone wants to inspect it w/o figuring which of the many points it is they can just hover over it.
As we all know, as amazing as algorithms like UMAP are, they are still missing massive amounts of explained variance compared to significantly higher N levels of dimensionality. This is why the dimensionality of "word vectors" or of embeddings in general continues to go up... it really does improve the down-stream performance of the model.
I am not sure as to how refined Bokeh is with user editable 3D plots, but it should be explored. I defenitely implemented this sort of functionality (poorly) with Matplotlib on a 3D scatterplot back in 2018 for my own text-project
Besides being even cooler for users - this would make the ideal use-cases of this model (fast data-labeling, effective active learning) far far better. Consider implementing this as a large number of 2D scatterplots plotting various dimensions against each other if 3D scatterplots won't work within Bokeh.
Prioritized feature.
Bokeh
is obviously capable of including images in its HoverTool
.
Hover
's core classes assumed no specific data format, but only the subclasses for text data are adequately implemented and tested so far.
The quick-start is a little too brief. Here's a sketch of a tutorial notebook (possibly split into multiple sections):
simple_annotator
in actionSupervisableDataset
BokehCorpusAnnotator
BokehCorpusExplorer
BokehSoftLabelExplorer
BokehSnorkelExplorer
BokehMarginExplorer
I love this tool. I've been using Bokeh along with UMAP/Ivis/PCA and clustering for dataset visualization like this for awhile - but I am happy to see someone automate this exact use-case since I've had to hand-roll this kind of tool for my own clustering / dimensionality reduction projects many times.
I think the logical extension to a tool like this is allowing someone to define their own decision boundary of a supervised model (they call this "machine teaching" rather than machine learning). Defining their own decision boundary should end up with them having a supervised classifier at the end and being able to visualize how that classifier operates (and ideally allowing an expert human to "tune" it). Note that this is different than the current "select aspects of the dataset by drawing" functionality built in.
One easy way to implement this is to allow the user to "draw" like you do earlier - but then making it where the user is actually drawing a "psudo-subset" (but is actually creating new data) of their initial data. Fit the classified model on this "psudo-subset", and it should end up training fast and giving the user some kind of "equation" (e.g if you choose linear models) or some other interpretation mechanism (e.g. decision trees). When the expert changes bits of how this supervised model works - the model equation or interpretation should update. No need to do CV since it's human eye-balls giving you your regularization for you.
It's a lot of work but I anticipate that if you implement it correctly you'd be well into the thousands of github stars because it's fking obvious but is a huge win in situations where say, a doctor may in fact be capable of "fixing" erroneous parts of a medical imaging AIs decision boundary.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.