code-kern-ai / refinery Goto Github PK

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.

Home Page: https://www.kern.ai

License: Apache License 2.0

Python 67.82% Shell 12.67% Batchfile 19.51%

annotations data-centric-ai data-labeling deep-learning labeling labeling-tool machine-learning natural-language-processing neural-search nlp

refinery's People

Contributors

Stargazers

Watchers

Forkers

jonathimer kustomzone mindkhichdi ronaldotn devdoshi waner-picpay evelynmitchell pavanbuddha convect-bot ekapujiw2002 tim810306 mic-360 wiertzbl alonecandies magespawn edcastaneda therakeshpurohit willie-lin kexuzzz app-creative quantummonkey msoftware arnavdas88 babajideowoyele pesanbisa simon-roe jhoetter flxflx macos frikadellios anis-gitdisco windson87 stjordanis ldatacentric yyht cemberk ericxsun codwest stanleyjacob jrcribb ahz119 emanuelfromflorence carlosrcoelho boss2256 mathias-ndasi rasdani yunusgumussoy roderick-liu elijacobsen lapnd mastersatish won21kr narasimmansaravana1994 shism2 rohankumardubey brunoscaglione epinnock ben-haim mohamedtaoufik sahinasli tillrudnick rashettycode wbing520 fiyinfoluwa6

refinery's Issues

List multiple weak supervision algorithms

Is your feature request related to a problem? Please describe.
Weak supervision is not only one specific algorithm, but you can actually choose from a set of formulas. Let users decide which algorithm they want to pick :)

Describe the solution you'd like
General SGD variant, Triplet-based, etc.

Also, for the sigmoid confidence estimation, enable users to provide parameters for c and k (see weak-nlp library)

Describe alternatives you've considered
Users could have a programmatic interface with template-injected code for weak supervision. But this is only for high customization and most likely not required at the moment.

Additional context
-

Zero-shot extraction

Is your feature request related to a problem? Please describe.
Currently, refinery only supports zero-shot classification.

Describe the solution you'd like
Embed zero-shot models for extraction tasks from HuggingFace

Describe alternatives you've considered
Zero-shot could also be implemented w/o HuggingFace, i.e. only using the embedded documents/tokens and calculating the nearest label item. Would arguably be faster, but likely worse in performance.

Additional context
-

Attribute calculation

Is your feature request related to a problem? Please describe.
Similar to pandas, I want to be able to create new attributes given some logic to apply.

Describe the solution you'd like
This could look very similar to labeling functions, e.g. as follows:

def cat_attributes(record):
    return str(record["attr1"]) + str(record["attr2"])

to concatenate two textual attributes.

To avoid data clutter, this should also allow users to a) distinguish between original attributes in the data and b) delete attributes

Describe alternatives you've considered
Changing the data before you upload it - very static and not dev friendly

Additional context
I already implemented a proof-of-concept before our OS release :)

Test

Edit a single script to change all of the heuristics

Is your feature request related to a problem? Please describe.
I have multiple heuristics all with the same initial check (not " no " in record["text"].text.lower()). If the condion changes I have to change all heuristics.

Describe the solution you’d like
Have something like a pipeline (one script that can be run prior to every other script) so I only need to change the condition in one place

Describe alternatives you’ve considered
Using a Dataslice as a “whitelist” for the function might also do the trick

Additional context
Requested by GeorgePearse on Discord

Parallel/Distributed computation of heuristic chunks

Is your feature request related to a problem? Please describe.
For projects with heuristic runtimes of more than a minute (rough estimate), parallelization should be an option to reduce the waiting time for heuristics

Describe the solution you'd like
Labelingfunctions are designed to be independent record-wise (i.e. record 1 is independent from record 2), so you can create N chunks and parallelize the computation.

Describe alternatives you've considered
-

Additional context
-

Add CLI check for newer main repo version

Is your feature request related to a problem? Please describe.
Currently, the cli tool pulls the repository only if the folder doesn't exist. This can result in outdated versions of the start script.

Describe the solution you'd like
Check if a newer version of the repository is available. (e.g. by saving a commit hash).
When running a cli command check if a newer version is available and prompt an update request (yes/no).

Describe alternatives you've considered

Always pull the newest version on start. This might result in a situation where someone doesn't want to update so maybe not be optimal.
create an update command & only check on execution of said command -> this might result in someone missing a curtail update

Additional context
Feature wish arose from the question of updating "existing" users (in relation to an upcomming update related to minio & qdrant storage (#27))

Heuristic disagreement

Is your feature request related to a problem? Please describe.
It would be good to see where heuristics disagree (and possibly how strongly).

Describe the solution you’d like
A way to filter for disagreeing heuristics. Maybe include a way to filter for the amount of disagreement. So e.g. if two active learning modules are sure (>90%) but disagree it's a more important case than two that disagree with <20%.

Describe alternatives you’ve considered
-

Additional context
Requested by GeorgePearse on Discord

For Active Learning way to view the instances for which different models most strongly disagree
Disagreement between a heuristic and a model is a great simple way to get good Active Learning results. So I'd want to be able to look at (/select to view more closely and label) disagreement between any heuristics and any models.

Lookup list recommendations

Is your feature request related to a problem? Please describe.
Lookup lists offer great value, but might still be cumbersome to fill. I want to have this (semi-)automated.

Describe the solution you'd like
Automatically infer new lookup list items for suggestions from existing terms of a given list.

Describe alternatives you've considered
-

Additional context
Might be via span labeling active learning or integrating external resources

[BUG] - `refinery start` leads to gitpython error

Describe the bug
refinery start leads to this error, right after pip installing 🥺:

Traceback (most recent call last):
  File "/Users/adrien/miniconda3/bin/refinery", line 5, in <module>
    from refinery.cli import main
  File "/Users/adrien/miniconda3/lib/python3.8/site-packages/refinery/cli.py", line 6, in <module>
    from git import Repo
  File "/Users/adrien/miniconda3/lib/python3.8/site-packages/git/__init__.py", line 6, in <module>
    from repository import Repository, InvalidRepositoryError
ModuleNotFoundError: No module named 'repository'

Desktop (please complete the following information):

OS: macOS Monterey 12.4

Constraints for label management

Is your feature request related to a problem? Please describe.
I have a lot of label options in different labeling tasks. If I select one, the possible other label options are drastically reduced, e.g. if I choose "outerwear" as "category", I will have a specific set of options for "subcategory" as in a tree.

Describe the solution you'd like
Label constraints, e.g. via label hierarchies. For instance, if l1-task-label is "positive", l2-task-labels can only be "joyful", "thankful" etc.

Describe alternatives you've considered
-

Additional context
Might be altered a bit to also improve data quality by finding records that don't fit to the label constraints

Writing weak supervision results into accessible metadata

Is your feature request related to a problem? Please describe.
I have e.g. some tweets that I want to clean for my model. I have identified and built a weak supervision procedure to tag them automatically. Still, I can't get rid of these links inside the application.

Describe the solution you'd like
Feed the weakly supervised data back into the record data. For instance, if I know that "Check out this tool! https://github.com/code-kern-ai/refinery" contains the link "https://github.com/code-kern-ai/refinery", write that metadata into the record itself:

{
  "tweet": "Check out this tool! https://github.com/code-kern-ai/refinery",
  "tweet__entities": [
    "link": [5, 6]
  ]
}

Describe alternatives you've considered
Doing that modification outside of the application, e.g. in a jupyter notebook. However, this makes iteration harder and it currently is not possible to update the attributes themselves.

Additional context
related to #40

[BUG] - Can't run `refinery_start` outside of refinery repository

Describe the bug
I have installed pip install kern-refinery and want to now run the server via refinery start. This works only if I'm in the cloned repository, however, I want to do this from any path once I've installed the library.

To Reproduce
Steps to reproduce the behavior:

Run pip install kern-refinery
Go to a directory that doesn't have refinery as its child
Run refinery start in your CLI
See error

Expected behavior
I'd expect this to do the same as if I'd run refinery start from inside the git repository.

Desktop (please complete the following information):

OS: macOS Monterey 12.1
Browser: Chromse

Create heuristic (labeling function) from data browser filter

Is your feature request related to a problem? Please describe.
After exploring the data and finding a good filter transferring it to a heuristic is tedious/impossible.

Describe the solution you’d like
Two versions:

Lite variation -> Auto generate e.g. a regex matcher for the current attribute filter
Advanced -> (possibly in a later version) Full matcher including label filter etc. This would include that additional data is provided in the labeling functions (e.g. label data)

Describe alternatives you’ve considered
-

Additional context
Requested by GeorgePearse on Discord

[BUG] - Minio & qdrant storage is not persistant

Describe the bug
In the current docker template file the object storage (minio) & qdrant have not volume attached. This means they are not saved persistently.

To Reproduce
Steps to reproduce the behavior:

Create Project
Let Tokenization finish
Run Labelingfunction (docbin_full is used)
use stop script to shut down the app
use start script to start the app
run the function
Error

Expected behavior
Should work.

Desktop (please complete the following information):

OS:: all
Browser: all
Version: 1.0.0

[BUG] - Problems while uploading files

Hi everyone. I'm afraid I'm experiencing the same error as this one: once refinery is launched, in the GUI, while creating a new project, the Upload file functionality is disabled (clicking doesn't allow to select any files). I'm not sure if this is the same problem, since I'm working on a Mac and the IP adress seems to be correctly gathered. I'm also using the self-hosted version.

Any thoughts on this?. Thanks in advance

Upload into data storage

Is your feature request related to a problem? Please describe.
I have multiple files which I want to combine, e.g. source_a and source_b. Or I want to modify data before I load it into a project; generally, I want to be able to program what I give as input.

Describe the solution you'd like
Uploaded files should be stored in some data storage, and their files should be accessed programmatically. For instance, if I want to label duplicates in my data, I want to be able to loop over the rows and compare their embeddings to only insert into my projects interesting potential duplicate rows.

Describe alternatives you've considered
Implementing that workflow outside of the app and then inserting the data

Additional context
e.g. interesting to build training data for encoders that help to detect duplicates in my data

Multilabel classification

Is your feature request related to a problem? Please describe.
I want to label two genres (labels) in one labeling task for a review, e.g. "Horror" and "Action".

Describe the solution you'd like
Choose multiple labels per task on one record for classification tasks. Includes ways to manage weak supervision, statistics/visualizations, and export functions

Describe alternatives you've considered
Current workflow, but one labeling task per binary label - lots of overhead for the users

Additional context
-

Manage weak supervision runs

Is your feature request related to a problem? Please describe.
If I create new heuristics or improve my active learner, I want to versionize my weakly supervised labels, as I fear that they might be worse than my previous run.

Describe the solution you'd like
Enable multiple weak supervision versions (i.e. packages)

Describe alternatives you've considered
Downloading and versionizing the data with e.g. DVC, but that becomes cumbersome and slows down my workflow to experiment with the data more.

Additional context
-

Simple label management

Is your feature request related to a problem? Please describe.
I want to change my label names (e.g. because of spellings etc.), potentially merge two labels, and add label descriptions.

Describe the solution you'd like
Renaming and merging labels, adding label descriptions. Could also include a description for the labeling task, if you want to provide annotators with added information

Describe alternatives you've considered
-

Additional context
-

S3 import option

Is your feature request related to a problem? Please describe.
I think that is clear :-)

Describe the solution you'd like
Automatically pull data from S3

Describe alternatives you've considered
-

Additional context
@SirDegraf this is not urgent (see project roadmap), but since you're the one with most S3 experience, I'll already assign this to you

High-level data quality estimation metric

Is your feature request related to a problem? Please describe.
I want to have a generalistic metric to estimate how "good" my data quality is (potentially from different angles)

Describe the solution you'd like
(Confident learning-based approach to) estimate the manual error rate from weakly supervised labels and manual labels.

Describe alternatives you've considered

F1-score weakly supervised to manually labeled
Accuracy weakly supervised to manually labeled

Additional context
cleanlab can be used to easily prototype this.

Nested span labeling

Is your feature request related to a problem? Please describe.
I have one attribute in which I want to label multiple labels that might overlap. For instance, "It has a great OLED screen" or something similar, I want to label "great OLED screen" as positive and "OLED screen" as characteristic (just as an example, there are many better ones you could think of)

Describe the solution you'd like
Choose multiple span labels per task on one text for extraction tasks. Includes ways to manage weak supervision, statistics/visualizations, and export features.

Describe alternatives you've considered
Copying the text attributes - again, just a very basic workaround, that is not dev friendly.

Additional context
-

Re-visit and mark records (e.g. uncertainty)

Is your feature request related to a problem? Please describe.
There is no way to tell if a user has already looked into a record or not other than if the record is labeled or not at the moment.
E.g. if there is a record without span labels, and we only have a span labeling task, we could not tell whether a user looked into this already or not.

Describe the solution you'd like
Marking records where annotators are uncertain about the label (metadata to be filtered and taken into account for statistics), and marking when records should be re-visited.
This way, we can assume that all labels of a record have been set as soon as it has been visited (relevant for extraction task heuristics)

Describe alternatives you've considered
Automatically marking which records have been visited by a user.

Additional context
-

Extensive user management

Is your feature request related to a problem? Please describe.
I have a complex problem which I want to tackle via multiple annotators or engineers.

Describe the solution you'd like
Invite users to your organization from inside the application to join the labeling team and grant different roles (e.g. engineer or annotator).

Describe alternatives you've considered
-

Additional context
Will be a premium feature for the managed version.

[BUG] - Import Options aren't transferred correctly

Describe the bug
File Import options are not interpreted /transferred to backend

To Reproduce
Steps to reproduce the behavior:

Have a file that needs specified import options (e.g. the one attached with sep=;)
Click on New Project
Add File and import Options
Upload Data (Proceed)
Columns aren't split and displayed as one

Expected behavior
Columns are split into two

Screenshots

Desktop (please complete the following information):

OS: all
Browser all
Version 1.0.0

Additional context
Example file with sep=; needed
csvTest_semicolon.zip

Accept weakly supervised labels as manual labels

Is your feature request related to a problem? Please describe.
As soon as there are weakly supervised labels, one could label much faster by switching to an "confirmation mode" instead of selecting the label yourself (i.e. going from a multiclass task to a binary task)

Describe the solution you'd like
Option to accept - per task level - weakly supervised labels as manual labels

Describe alternatives you've considered
-

Additional context
-

Option to create lookup lists from classification tasks

Is your feature request related to a problem? Please describe.
It is cumbersome and unintuitive to create a span labeling task label to collect lookup list values if I only want to create a classification task. I want to be able to easily create lookup lists by only creating a classification task.

Describe the solution you'd like
Mirror an extraction labeling task to create lookup lists when there is only a classification task actively used. This could be an option to select when creating the labeling task.

Describe alternatives you've considered
Manually do so without automation (e.g. copy paste every label)

Additional context
Also, it would be helpful if there is an option to automatically set the classification task label if I manually set a span label.

Derive labeling function from lookup list

Is your feature request related to a problem? Please describe.
It is cumbersome to write a labeling function that always has the exact same structure if I want to use a lookup list for my labeling functions.

Describe the solution you'd like
Automatically create a labeling function given some lookup list (e.g. by asking for the attribute to apply this to, or by linking a lookup list to attributes)

Describe alternatives you've considered
Manually doing so

Additional context
-

Classification weak learners should allow more than one input feature

Hi team,

I decided to give Refinery a try with a classification problem where there are more than one input features, and the idea is to classify their combination into a few categories.

To give an example of a similar problem, imagine an oxymoron classification task with 2 input features: word_a and word_b, and a binary class output: is_oxymoron and not_oxymoron.

The problem I have is that the two features or their embeddings are useless in isolation, it's their interaction that counts. But in all weak learner possibilities, I apparently must choose feature a or feature b, I can't use both. Am I misunderstanding something? This could be something I don't understand in the UI.

Also, I would expect to be able to transform the input data with my own functions and use that as input as well; although not ideal, this could be used to work around the limitation of one input feature per learner.

Otherwise, it looks good and the UI is rather well-organised.

Further export options

Is your feature request related to a problem? Please describe.
JSON is not always the desired format for exports, so there should be multiple options to select from.

Describe the solution you'd like
Provide multiple options (e.g. spreadsheet, CSV etc.) for the data export, and enable integrations to e.g. S3

Describe alternatives you've considered
Python SDK export options; the SDK export options and UI export options should be kept in sync.

Additional context
-

Managed version workflow test into production

Is your feature request related to a problem? Please describe.
There currently is one production system for our managed version, which is being used by trial users and paying clients.

Describe the solution you'd like
Trial users should get access to a dedicated trial system; once they transfer to the production system, their data should be transferred e.g. via project snapshot export/import. Same is true for their user ids.

Describe alternatives you've considered
-

Additional context
Only relevant for the managed version, not for the OS version

Data management presets

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

"records with no labels in random order"
"disagreeing heuristics" (see issue #20 )
"mismatch weakly supervised and manually labeled"

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
-

Test

Custom and finetuned embedding computation

Is your feature request related to a problem? Please describe.
The creation of embeddings can range from straight-forward to super customized. Similar to labeling functions, the creation of embeddings should have a flexible interface.

Describe the solution you'd like
Provide a programmatic interface for embedding calculation to enable users to build custom embeddings, and to fine-tune the models with labeled data. For instance (roughly):

from embedders.classification.contextual import TransformerSentenceEmbedder
def classification_word_a_cat_word_b_distilbert(record):
    embedder = TransformerSentenceEmbedder("distilbert-base-cased")
    return embedder.fit_transform(record["word_a_cat_word_b"], record["is_oxymoron"])

Describe alternatives you've considered
see issue #24, which is an option to upload custom embeddings.

Additional context
related to issue #24 - but this is related to in-app actions

[BUG] - Lookup export list does not match the import type

Describe the bug
If the list with the terms is downloaded, the same list cannot be uploaded because the export and the import types are different.

To Reproduce
Steps to reproduce the behavior:

Go to 'lookup-lists'
Click on 'Download list'
Click on 'Upload terms'
Attach the file
See the error: 'Import of knowledge base failed'

Expected behavior
When the user downloads a list with terms, he should be able to upload it.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: macOS
Browser: Chrome

Additional context
Add any other context about the problem here.

File upload for project creation not working

Hello,

I am using the self-hosted version (v. 1.0). However, the file upload does not work when I try to create a new project. I don't get any error in the GUI, the screen just stays the same. When I press "Proceed" a second time, it tells me "Project title exists". And the project does indeed exist, however without any records.

Here is the outpout after staring it:

"UI:           http://localhost:4455/app/"
"Minio:        http:// =:7053"
"MailHog:      http://localhost:4436/"

I am running it on Windows.

I am unsure which logs could be relevant, but this is the Docker-output:

refinery-refinery-gateway-1 | ValueError: Invalid endpoint: http:// =:7053
refinery-refinery-gateway-1 | ERROR:graphql.execution.utils:Traceback (most recent call last):
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
refinery-refinery-gateway-1 |     return executor.execute(resolve_fn, source, info, **args)
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/graphql/execution/executors/sync.py", line 16, in execute
refinery-refinery-gateway-1 |     return fn(*args, **kwargs)
refinery-refinery-gateway-1 |   File "./graphql_api/query/transfer.py", line 65, in resolve_upload_credentials_and_id
refinery-refinery-gateway-1 |     project_id, user.id, file_name, file_type, file_import_options
refinery-refinery-gateway-1 |   File "./controller/transfer/manager.py", line 43, in get_upload_credentials_and_id
refinery-refinery-gateway-1 |     return s3.get_upload_credentials_and_id(org_id, project_id + "/" + str(task.id))
refinery-refinery-gateway-1 |   File "./submodules/s3/controller.py", line 367, in get_upload_credentials_and_id
refinery-refinery-gateway-1 |     response = minio.get_upload_credentials_and_id(target_bucket)
refinery-refinery-gateway-1 |   File "./submodules/s3/connections/minio.py", line 172, in get_upload_credentials_and_id
refinery-refinery-gateway-1 |     use_ssl=False,
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/boto3/__init__.py", line 93, in client
refinery-refinery-gateway-1 |     return _get_default_session().client(*args, **kwargs)
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/boto3/session.py", line 275, in client
refinery-refinery-gateway-1 |     aws_session_token=aws_session_token, config=config)
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/botocore/session.py", line 874, in create_client
refinery-refinery-gateway-1 |     client_config=config, api_version=api_version)
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 93, in create_client
refinery-refinery-gateway-1 |     verify, credentials, scoped_config, client_config, endpoint_bridge)
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 362, in _get_client_args
refinery-refinery-gateway-1 |     verify, credentials, scoped_config, client_config, endpoint_bridge)
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/botocore/args.py", line 108, in get_client_args
refinery-refinery-gateway-1 |     proxies_config=new_config.proxies_config)
refinery-refinery-gateway-1 |   File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 335, in create_endpoint
refinery-refinery-gateway-1 |     raise ValueError("Invalid endpoint: %s" % endpoint_url)
refinery-refinery-gateway-1 | graphql.error.located_error.GraphQLLocatedError: Invalid endpoint: http:// =:7053

I have tried installing it via pip and directly from the repo. I have also tried different browsers.

Best,
Leo

Add the possibility to label in data browser

Is your feature request related to a problem? Please describe.
When testing around/Scrolling through the data browser being able to add a label right then and there would help with the workflow. Jumping into the session and back without saving the filter as a slice is pretty much impossible. Even with a saved slice, the current position needs to be "found"/scrolled down to again.

Describe the solution you’d like
Being able to assign a label in the data-browser (possibly for a subset e.g. Only Full Record - Classification tasks)

Describe alternatives you’ve considered
The current workflow

Additional context
Requested by GeorgePearse on Discord

Visualize embeddings in 2 or 3 dimensions

Is your feature request related to a problem? Please describe.
Visualize data to get a better overview of "missed" spots or clusters of instances that a model got wrong.

Describe the solution you’d like
Dimension reduction for easy visualization of datapoints

Describe alternatives you’ve considered
-

Additional context
Requested by GeorgePearse on Discord

@jens @jhoetter I think the core value of visualization of low dimensionality data is to see whether there are any clusters/classes you've completely missed so far, and if so, how large are they. Hard to understand that from the current UI design.

After the embeddings you could just have a "select dimensionality reduction" option with PCA, t-sne, and UMAP as the dimensionality reduction methods (UMAPs worked best for me in the past).

Also helps with Active Learning if you can see a cluster of instances that the model gets wrong.

[BUG] - `cli.py` not known with `pip install kern-refinery`

Describe the bug
The CLI can't be used due to an module error

To Reproduce
Steps to reproduce the behavior:

pip install kern-refinery==1.0.1
refinery start
See error

Expected behavior
Run the server

Error

Traceback (most recent call last):
  File "/Users/jhoetter/opt/anaconda3/bin/refinery", line 5, in <module>
    from cli import main
ModuleNotFoundError: No module named 'cli'

Desktop (please complete the following information):

OS: MacOS Monterey 12.1
Browser Chrome

[DOC] - Missing docs about configuration page

What is missing in the docs?
There is no description about the configuration page and its initial settings, and how to change them. For instance, see discussions thread #58

Monitor running tasks

Is your feature request related to a problem? Please describe.
There are a lot of background tasks running, e.g. embedding creation, tokenization or zero-shot. It can become difficult to keep track of running tasks.

Describe the solution you'd like
On project overview page, provide insights into currently running tasks like tokenization, embeddings or heuristics

Describe alternatives you've considered
Notification center writing updates, but the notes can be overseen (e.g. while you're getting a well-deserved cup of coffee) or can become spammy.

Additional context
-

Online playground

Is your feature request related to a problem? Please describe.
To play around with the application, I want to have an option for which I don't need to download the application or sign up for a test version.

Describe the solution you'd like
I'd like to log in with a given demo user (e.g. user demo and pw demo), which gives me a read-only (or something similar) access to the application online.

Describe alternatives you've considered
-

Additional context
Similar to the demo version of bytebase

Prompt engineering zero-shot

Is your feature request related to a problem? Please describe.
I have valuable context information I could provide to my zero-shot model (such as label descriptions or few manually labeled reference data).

Describe the solution you'd like
Enable users to modify the hypothesis template and to embed labeled data into the context for few-shot models

Describe alternatives you've considered
-

Additional context
see https://www.kern.ai/post/how-to-engineer-prompts-for-zero-shot-modules for more information about prompt engineering

Customize spaCy tokenizers

Is your feature request related to a problem? Please describe.
The spaCy tokenizers sometimes lead to wrong tokens, e.g. for HTML data, tweets or often domain-specific terms.
For instance, 'refinery is #opensource', I might want ['refinery', 'is', '#opensource'], but i get ['refinery', 'is', '#', 'opensource'].

Describe the solution you'd like
spaCy allows users to customize the tokenizer, e.g. in this stackoverflow thread: https://stackoverflow.com/questions/51012476/spacy-custom-tokenizer-to-include-only-hyphen-words-as-tokens-using-infix-regex

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

We should allow users to customize their tokenizers, similar to either our programmatic interfaces.

Describe alternatives you've considered
NLTK offers a wider set of tokenizers (https://www.nltk.org/api/nltk.tokenize.html), e.g. also for tweets. But I strongly believe we should stick to one tokenizer solution for now, which is spaCy.

Additional context
-

Scatter graph UI based on some dimensionality reduced embeddings

Is your feature request related to a problem? Please describe.
Getting a visual representation of embeddings can help to cluster data/get a better overview.

Describe the solution you’d like
a way to visualize current embedding data, preferred on a dimensional reduced set to get a better overview.

Describe alternatives you’ve considered
-

Additional context
Requested by GeorgePearse on Discord
(e.g. with bulk)

Workflow for resolving manual labeling conflicts

Is your feature request related to a problem? Please describe.
Annotators often times disagree, creating conflicts in the gold label resolution ("which one is the ground truth?")

Describe the solution you'd like
Similar to merge conflicts in GitHub, show differences and easily click through them

Describe alternatives you've considered
Filter by clicking on inter-annotator-agreement matrix to jump to disagreement slice

Additional context
-

Add a check for the PyPI version

Is your feature request related to a problem? Please describe.
software gets outdated. How do I ensure that I always have the correct version of the PyPI version installed?

Describe the solution you'd like
on using the start command of the library id like to have a check for the version and a prompt that allows me then and there

Describe alternatives you've considered
always using the update command before I run a command

Extend functionality for zero shot models

Is your feature request related to a problem? Please describe.
I want to use a zero-shot model (joeddav/xlm-roberta-large-xnli - link) that needs the protobuf library

Describe the solution you'd like
-

Describe alternatives you've considered
Other usable models with multilanguage support or Spanish

Additional context
Reported by xavialex on Discord

Zero shot log

ImportError: 
XLMRobertaConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment.

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
--- Running on CPU. If you're facing performance issues, you should consider switching to a CUDA device
INFO:     172.21.0.21:43244 - "GET /recommend HTTP/1.1" 200 OK
INFO:     172.21.0.21:43908 - "GET /recommend HTTP/1.1" 200 OK
INFO:     172.21.0.21:44016 - "POST /zero-shot/sample-records HTTP/1.1" 200 OK
INFO:     172.21.0.21:44074 - "POST /zero-shot/sample-records HTTP/1.1" 500 Internal Server Error
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
ERROR:    Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 366, in run_asgi
  result = await app(self.scope, self.receive, self.send)
File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
  return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 261, in __call__
  await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 119, in __call__
  await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 181, in __call__
  raise exc
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 159, in __call__
  await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/site-packages/starlette/exceptions.py", line 87, in __call__
  raise exc
File "/usr/local/lib/python3.10/site-packages/starlette/exceptions.py", line 76, in __call__
  await self.app(scope, receive, sender)
File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
  raise e
File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
  await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 659, in __call__
  await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 259, in handle
  await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 61, in app
  response = await func(request)
File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 227, in app
  raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 162, in run_endpoint_function
  return await run_in_threadpool(dependant.call, **values)
File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 45, in run_in_threadpool
  return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 28, in run_sync
  return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
  return await future
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 754, in run
  result = context.run(func, *args)
File "/program/./app.py", line 45, in zero_shot_text
  return_values = util.get_zero_shot_10_records(
File "/program/./util/util.py", line 152, in get_zero_shot_10_records
  result = get_zero_shot_labels(
File "/program/./util/util.py", line 120, in get_zero_shot_labels
  result = get_labels_for_text(
File "/program/./model_integration/controller.py", line 16, in get_labels_for_text
  return generic.get_labels_for_text(
File "/program/./model_integration/models/generic.py", line 20, in get_labels_for_text
  classifier = __get_classifier_with_web_socket_update(
File "/program/./model_integration/models/generic.py", line 63, in __get_classifier_with_web_socket_update
  classifier = __get_classifier(config)
File "/program/./model_integration/models/generic.py", line 46, in __get_classifier
  __classifier[config] = pipeline("zero-shot-classification", model=config)
File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 598, in pipeline
  tokenizer = AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
  return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1780, in from_pretrained
  return cls._from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1915, in _from_pretrained
  tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 139, in __init__
  super().__init__(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 112, in __init__
  fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "/usr/local/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1033, in convert_slow_tokenizer
  return converter_class(transformer_tokenizer).converted()
File "/usr/local/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 421, in __init__
  requires_backends(self, "protobuf")
File "/usr/local/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 761, in requires_backends
  raise ImportError("".join(failed))
ImportError: 
XLMRobertaConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment.

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
ERROR:    Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 366, in run_asgi
  result = await app(self.scope, self.receive, self.send)
File "/usr/local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
  return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/fastapi/applications.py", line 261, in __call__
  await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/applications.py", line 119, in __call__
  await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 181, in __call__
  raise exc
File "/usr/local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 159, in __call__
  await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/site-packages/starlette/exceptions.py", line 87, in __call__
  raise exc
File "/usr/local/lib/python3.10/site-packages/starlette/exceptions.py", line 76, in __call__
  await self.app(scope, receive, sender)
File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
  raise e
File "/usr/local/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
  await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 659, in __call__
  await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 259, in handle
  await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/site-packages/starlette/routing.py", line 61, in app
  response = await func(request)
File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 227, in app
  raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/site-packages/fastapi/routing.py", line 162, in run_endpoint_function
  return await run_in_threadpool(dependant.call, **values)
File "/usr/local/lib/python3.10/site-packages/starlette/concurrency.py", line 45, in run_in_threadpool
  return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 28, in run_sync
  return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
  return await future
File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 754, in run
  result = context.run(func, *args)
File "/program/./app.py", line 45, in zero_shot_text
  return_values = util.get_zero_shot_10_records(
File "/program/./util/util.py", line 152, in get_zero_shot_10_records
  result = get_zero_shot_labels(
File "/program/./util/util.py", line 120, in get_zero_shot_labels
  result = get_labels_for_text(
File "/program/./model_integration/controller.py", line 16, in get_labels_for_text
  return generic.get_labels_for_text(
File "/program/./model_integration/models/generic.py", line 20, in get_labels_for_text
  classifier = __get_classifier_with_web_socket_update(
File "/program/./model_integration/models/generic.py", line 63, in __get_classifier_with_web_socket_update
  classifier = __get_classifier(config)
File "/program/./model_integration/models/generic.py", line 46, in __get_classifier
  __classifier[config] = pipeline("zero-shot-classification", model=config)
File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 598, in pipeline
  tokenizer = AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
  return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1780, in from_pretrained
  return cls._from_pretrained(
--- Running on CPU. If you're facing performance issues, you should consider switching to a CUDA device
INFO:     172.21.0.21:44864 - "GET /recommend HTTP/1.1" 200 OK
INFO:     172.21.0.21:44884 - "POST /zero-shot/sample-records HTTP/1.1" 500 Internal Server Error
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1915, in _from_pretrained
  tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 139, in __init__
  super().__init__(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 112, in __init__
  fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "/usr/local/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1033, in convert_slow_tokenizer
  return converter_class(transformer_tokenizer).converted()
File "/usr/local/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 421, in __init__
  requires_backends(self, "protobuf")
File "/usr/local/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 761, in requires_backends
  raise ImportError("".join(failed))
ImportError: 
XLMRobertaConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment.

--- Running on CPU. If you're facing performance issues, you should consider switching to a CUDA device
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)

Upload personally created embeddings

Is your feature request related to a problem? Please describe.
Embeddings are only creatable through the app & with huggingface models. There is no way to upload/integrate already existing/precomputed embeddings

Describe the solution you’d like
I want to upload personally created embeddings (e.g. an upload option similar to the record upload). They should be usable in the app as well as a visualization for them (e.g. with koaning/bulk - link see Additional Context).

Describe alternatives you’ve considered
-

Additional context
Requested by GeorgePearse on Discord

One last thought before I call it a day. I know the variation in dimensionality is what you stated was the problem with an upload embeddings functionality, but I actually only want to upload 2d 'embeddings' e.g. the output of UMAP such that it can actually be usefully visualized, in the same way, that koaning/bulk and https://github.com/phurwicz/hover allow you to. This covers quite a lot of use cases (admittedly 2D would not be so good for 'get similar' with QDrant, but great for a quick summary, they may just be two completely different features)

In this space (super quick visualization and labelling) there are a few tools, but none are set up neatly enough to actually manage a project. And as for the production-grade tools (yourselves, rubrix, and a few others), none of you seem to have this feature, so it might be a nice way to distinguish yourselves a little.

demoability of a 2d scatter plot (with meaningful embeddings) to senior management is 10/10 when you're trying to argue that your team should adopt a tool Or in my case arguing that you should use NLP at all https://projector.tensorflow.org/

Actually had some interesting ideas, if you go to custom on the bottom left you can create axes of similarity to different examples. They just got some of the levels of abstraction wrong which makes it a real pain to work with. Also doesn't work for any meaningfully sized text

code-kern-ai / refinery Goto Github PK

refinery's People

Contributors

Stargazers

Watchers

Forkers

refinery's Issues

Recommend Projects

Recommend Topics

Recommend Org