Labelbox enables teams to maximize the value of their unstructured data with its enterprise-grade training data platform. For ML use cases, Labelbox has tools to deploy labelers to annotate data at massive scale, diagnose model performance to prioritize labeling, and plug in existing ML models to speed up labeling. For non-ML use cases, Labelbox has a powerful catalog with auto-computed similarity scores that users can leverage to label large amounts of data with a couple clicks.
This library was designed to run in a Databricks environment, although it will function in any Spark environment with some modification.
We strongly encourage collaboration - please free to fork this repo and tweak the code base to work for you own data, and make pull requests if you have suggestions on how to enhance the overall experience, add new features, or improve general performance.
Please report any issues/bugs via Github Issues.
- Databricks: Runtime 10.4 LTS or Later
- Apache Spark: 3.1.2 or Later
- Labelbox account
- Generate a Labelbox API key
Set up LabelSpark with the following lines of code:
%pip install labelspark -q
import labelspark as ls
api_key = "" # Insert your Labelbox API key here
client = ls.Client(api_key)
Once set up, you can run the following core functions:
-
client.create_data_rows_from_table()
: Creates Labelbox data rows (and metadata) given a Spark Table DataFrame -
client.export_to_table()
: Exports labels (and metadata) from a given Labelbox project and creates a Spark DataFrame
Notebook | Github |
---|---|
Basics: Data Rows from URLs | |
Data Rows with Metadata | |
Data Rows with Attachments | |
Data Rows with Annotations | |
Putting it all Together |
Notebook | Github |
---|---|
Exporting Data to a Spark Table |
While using LabelSpark, you will likely also use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK:
- Visit our docs to learn how the SDK works
- Checkout our notebook examples to follow along with interactive tutorials
- View the Labelbox API reference.
To enhance the software supply chain security of Labelbox's users, as of 0.7.4, every release contains a SLSA Level 3 Provenance document.
This document provides detailed information about the build process, including the repository and branch from which the package was generated.
By using the SLSA framework's official verifier, you can verify the provenance document to ensure that the package is from a trusted source. Verifying the provenance helps confirm that the package has not been tampered with and was built in a secure environment.
Example of usage for the 0.7.4 release wheel:
VERSION=0.7.4 #tag
gh release download 0.7.4 --repo Labelbox/labelspark
slsa-verifier verify-artifact --source-branch master --builder-id 'https://github.com/slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@refs/tags/v2.0.0' --source-uri "git+https://github.com/Labelbox/labelspark" --provenance-path multiple.intoto.jsonl ./labelspark-${VERSION}-py3-none-any.whl