labelbox / labelspark Goto Github PK

This library makes it easy to take unstructured data in your Data Lake and prepare it for analysis and AI work in Databricks. The Labelbox Connector for Apache Spark takes in a Spark DataFrame to create a dataset in Labelbox, and it also brings labeled, structured data back into Databricks also as a Spark DataFrame.

License: Apache License 2.0

Python 2.36% HTML 77.01% Jupyter Notebook 20.62%

labelspark's Introduction

The Official Labelbox <> Databricks Python Integration

Labelbox enables teams to maximize the value of their unstructured data with its enterprise-grade training data platform. For ML use cases, Labelbox has tools to deploy labelers to annotate data at massive scale, diagnose model performance to prioritize labeling, and plug in existing ML models to speed up labeling. For non-ML use cases, Labelbox has a powerful catalog with auto-computed similarity scores that users can leverage to label large amounts of data with a couple clicks.

This library was designed to run in a Databricks environment, although it will function in any Spark environment with some modification.

We strongly encourage collaboration - please free to fork this repo and tweak the code base to work for you own data, and make pull requests if you have suggestions on how to enhance the overall experience, add new features, or improve general performance.

Please report any issues/bugs via Github Issues.

Requirements
Setup
Example Notebooks

Requirements

Databricks: Runtime 10.4 LTS or Later
Apache Spark: 3.1.2 or Later
Labelbox account
Generate a Labelbox API key

Setup

Set up LabelSpark with the following lines of code:

%pip install labelspark -q
import labelspark as ls

api_key = "" # Insert your Labelbox API key here
client = ls.Client(api_key)

Once set up, you can run the following core functions:

client.create_data_rows_from_table() : Creates Labelbox data rows (and metadata) given a Spark Table DataFrame
client.export_to_table() : Exports labels (and metadata) from a given Labelbox project and creates a Spark DataFrame

Example Notebooks

Importing Data

Notebook	Github
Basics: Data Rows from URLs
Data Rows with Metadata
Data Rows with Attachments
Data Rows with Annotations
Putting it all Together

Exporting Data

Notebook	Github
Exporting Data to a Spark Table

While using LabelSpark, you will likely also use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK:

Visit our docs to learn how the SDK works
Checkout our notebook examples to follow along with interactive tutorials
View the Labelbox API reference.

Provenance

To enhance the software supply chain security of Labelbox's users, as of 0.7.4, every release contains a SLSA Level 3 Provenance document.
This document provides detailed information about the build process, including the repository and branch from which the package was generated.

By using the SLSA framework's official verifier, you can verify the provenance document to ensure that the package is from a trusted source. Verifying the provenance helps confirm that the package has not been tampered with and was built in a secure environment.

Example of usage for the 0.7.4 release wheel:

VERSION=0.7.4 #tag
gh release download 0.7.4 --repo Labelbox/labelspark

slsa-verifier verify-artifact --source-branch master --builder-id 'https://github.com/slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@refs/tags/v2.0.0' --source-uri "git+https://github.com/Labelbox/labelspark" --provenance-path multiple.intoto.jsonl ./labelspark-${VERSION}-py3-none-any.whl

labelspark's People

Contributors

Stargazers

Watchers

Forkers

laopeng2021 cristobalmitchell dey-abhishek nickaustinlee hhands44 merlinssrvyml

labelspark's Issues

New Bug: 0.5.2

labelspark/labelspark/bronze_to_silver.py

Line 27 in 27ead6b

if needs_koalas:

variable needs_koalas doesn't exist in bronze_to_silver since you commented out the variable creation part!

Requirements prevent library imoport on DB 7.3+

The setup.py requirements prevent the library from running on Azure Databricks 7.3+

The requirements in setup.py include pyspark, databricks and koalas. These packages are installed by default on Databricks. When running %pip install labelspark with the current setup it will install a different version of pyspark, databricks and koalas from Pypi. The package called databricks on Pypi is not the correct package and will overwrite the proper package on Databricks. This will cause import labelspark to fail with the error:

ModuleNotFoundError: No module named 'databricks.koalas'

A simple fix is to remove pyspark, databricks and koalas from the requirements.

Bug in Silver Table Code

I tried to convert a bronze table to silver where the Label.classification.answer column (from flattened bronze table) looked like this:

[[{"featureId": "redacted_for_confidentiality", "schemaId": "redacted_for_confidentiality", "title": "Response", "value": "response"}], [{"featureId": "redacted_for_confidentiality", "schemaId": "redacted_for_confidentiality", "title": "Response 2", "value": "response"}], [{"featureId": "redacted_for_confidentiality", "schemaId": "redacted_for_confidentiality", "title": "Response 3", "value": "response"}], [{"featureId": "redacted_for_confidentiality", "schemaId": "redacted_for_confidentiality", "title": "Response 4", "value": "response"}]]

As you can see, it is an array of arrays.

It seemed that the add_json_answers_to_dictionary() method failed to process the flattened label column properly. The method iterated through the array of Label.classifications.answer but the ast literal parsing seemed to get stuck after that. Also I think that since the method was designed to parse JSON strings, it was failing to process the Row.

This modification seemed to fix the method (simply tacked the code on at the end), but further testing is necessary.

#this runs if the literal stuff didn't run and it's not json 
    if isinstance(answer, list) and len(answer) == 1: 
      answer = answer[0] #puzzled how we get a list of length 1 where the contents is the row 
      my_dictionary[title] = answer["title"]
    else: 
      my_dictionary[title] = answer

I also believe we need to revamp add_json_answers_to_dictionary() to make it easier to debug, as it consists of some relatively brittle code built around edge cases encountered during the many different JSON scenarios we get back from the Labelbox API due to nesting and such.

Planned Upgrade: Optional flag to output SQL Compliant Column Names

Prior versions of the Labelbox Connector for Databricks tried to preserve column names to how they were expressed in the JSON output of Labelbox. For instance "Labeled Data" was expressed as a column named "Labeled Data".

Downstream workflows sometimes require accessing these columns in ways where a space in the name is impractical. Additionally, spaces need to be removed prior to saving the table as a Delta Lake table. Right now developers can run a simple column reformat to solve these issues.

To make it easier for developers downstream but avoid breaking existing code which may reference column names with spaces, we are exploring the addition of a flag "SQL_friendly_columns" which will output dataframes with the following characteristics:

All spaces will be replaced with underscores in column names
The dot format which we currently use to express nesting will be replaced with underscores.
All character cases will be preserved to match Labelbox JSON character case

Examples:

"Labeled Data" --> "Labeled_Data"
"Label.objects.title" --> "Label_objects_title"

Project Export Error

Hello, I'm getting errors when trying to export a project into Databricks.

I tried both Databricks 9.1 and 10.3 LTS, but they both give the same error.
lb_df = labelspark.get_annotations(client, "cl157rvtk0zxy0z7c1p7q7puy", spark, sc)

I managed to get other projects to export properly, but this particular project has about 50k annotated images and over 200k bbox annotations. I imported the annotations via Python SDK and confirmed annotating work as expected on the Labelbox UI but databricks export is not working for this project.

Error Message:

Py4JJavaError                             Traceback (most recent call last)
<[command-1481209477824276]()> in <module>
----> 1 lb_df = labelspark.get_annotations(client, "cl157rvtk0zxy0z7c1p7q7puy", spark, sc)

/databricks/python/lib/python3.8/site-packages/labelspark/__init__.py in get_annotations(client, project_id, spark, sc)
     49     with urllib.request.urlopen(project.export_labels()) as url:
     50         api_response_string = url.read().decode()  # this is a string of JSONs
---> 51     bronze_table = jsonToDataFrame(api_response_string, spark, sc)
     52     bronze_table = dataframe_schema_enrichment(bronze_table)
     53     return bronze_table

/databricks/python/lib/python3.8/site-packages/labelspark/__init__.py in jsonToDataFrame(json, spark, sc, schema)
    195     if schema:
    196         reader.schema(schema)
--> 197     return reader.json(sc.parallelize([json]))
    198 
    199 

/databricks/spark/python/pyspark/sql/readwriter.py in json(self, path, schema, primitivesAsString, prefersDecimal, allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, samplingRatio, dropFieldIfAllNull, encoding, locale, pathGlobFilter, recursiveFileLookup, allowNonNumericNumbers, modifiedBefore, modifiedAfter)
    382             keyed._bypass_serializer = True
    383             jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
--> 384             return self._df(self._jreader.json(jrdd))
    385         else:
    386             raise TypeError("path can be only string, list or RDD")

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115     def deco(*a, **kw):
    116         try:
--> 117             return f(*a, **kw)
    118         except py4j.protocol.Py4JJavaError as e:
    119             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o1781.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 3.0 failed 4 times, most recent failure: Lost task 15.3 in stage 3.0 (TID 20) (10.48.85.196 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2828)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2775)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2769)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2769)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1305)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1305)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1305)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3036)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2977)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2965)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1067)
	at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2477)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2460)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2572)
	at org.apache.spark.sql.catalyst.json.JsonInferSchema.infer(JsonInferSchema.scala:94)
	at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.$anonfun$inferFromDataset$1(JsonDataSource.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:273)
	at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.inferFromDataset(JsonDataSource.scala:110)
	at org.apache.spark.sql.DataFrameReader.$anonfun$json$1(DataFrameReader.scala:693)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:693)
	at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:673)
	at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:656)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

Compatibility Issue w/ DBR 10 and Spark 3.2

With the recent Pandas on Spark update and Koalas deprecation on DBR 10, the current Connector code which utilizes the Koalas library will throw an error.

For a near-term workaround, we recommend using DBR LTS 9.1. In the meantime we will work on an update to preserve functionality for LTS and higher, while allowing for the transition to DBR 10. Once DBR 10+ becomes LTS, we will deprecate all reference to Koalas.

Here is the notice from Databricks (DBR 10 released Oct 2021):

Koalas is deprecated on clusters that run Databricks Runtime 10.0 and above. For clusters running Databricks Runtime 10.0 and above, use Pandas API on Spark instead. If you try using Koalas on clusters that run Databricks Runtime 10.0 and above, an informational message displays, recommending that you use Pandas API on Spark instead.

export throwing error related to global key and external id

I'm doing an export on a small project with a mix of masks, bounding boxes, lines, and polygons. There seems to be an error in export_and_flatten labels related to global key or external id.

ValidationError: 1 validation error for DataRow
root
Must set either id or global_key (type=value_error)

Here is the full stacktrace:

`---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
in
----> 1 df = client.export_to_table(
2 project=project_id,
3 include_performance=True, # Will include columns that pertain to labeling / review performance
4 include_agreement=True, # Will include agreement score column
5 include_metadata=True, # Will include columns that pertain to metadata

/databricks/python/lib/python3.8/site-packages/labelspark/client.py in export_to_table(self, project, include_metadata, include_performance, include_agreement, verbose, mask_method, divider)
45 spark = pyspark.sql.SparkSession.builder.appName('labelspark_export').getOrCreate()
46
---> 47 flattened_labels_dict = export_and_flatten_labels(
48 client=self.lb_client, project=project,
49 include_metadata=include_metadata, include_performance=include_performance, include_agreement=include_agreement,

/databricks/python/lib/python3.8/site-packages/labelbase/downloader.py in export_and_flatten_labels(client, project, include_metadata, include_performance, include_agreement, verbose, mask_method, divider)
57 "external_id" : label["External ID"]
58 }
---> 59 res = flatten_label(label_dict=label, ontology_index=ontology_index, schema_to_name_path=schema_to_name_path, mask_method=mask_method, divider=divider)
60 for key, val in res.items():
61 flat_label[f"annotation{divider}{str(key)}"] = val

/databricks/python/lib/python3.8/site-packages/labelbase/annotate.py in flatten_label(label_dict, ontology_index, schema_to_name_path, mask_method, divider)
115 annotation_value = [array, [255,255,255]]
116 else:
--> 117 png = mask_to_bytes(input=obj["instanceURI"], method="url", color=[255,255,255], output="png")
118 annotation_value = [png, "null"]
119 if "classifications" in obj.keys():

/databricks/python/lib/python3.8/site-packages/labelbase/masks.py in mask_to_bytes(input, method, color, output)
55 )
56 # Convert back into ndjson
---> 57 mask_png = list(NDJsonConverter.serialize([mask_label]))[0]["mask"]["png"]
58 return mask_png

/databricks/python/lib/python3.8/site-packages/labelbox/data/serialization/ndjson/converter.py in serialize(labels)
42 """
43
---> 44 for example in NDLabel.from_common(labels):
45 res = example.dict(by_alias=True)
46 for k, v in list(res.items()):

/databricks/python/lib/python3.8/site-packages/labelbox/data/serialization/ndjson/label.py in from_common(cls, data)
42 data: LabelCollection) -> Generator["NDLabel", None, None]:
43 for label in data:
---> 44 yield from cls._create_non_video_annotations(label)
45 yield from cls._create_video_annotations(label)
46

/databricks/python/lib/python3.8/site-packages/labelbox/data/serialization/ndjson/label.py in _create_non_video_annotations(cls, label)
205 yield NDClassification.from_common(annotation, label.data)
206 elif isinstance(annotation, ObjectAnnotation):
--> 207 yield NDObject.from_common(annotation, label.data)
208 elif isinstance(annotation, (ScalarMetric, ConfusionMatrixMetric)):
209 yield NDMetricAnnotation.from_common(annotation, label.data)

/databricks/python/lib/python3.8/site-packages/labelbox/data/serialization/ndjson/objects.py in from_common(cls, annotation, data)
612 if (annotation.confidence):
613 optional_kwargs['confidence'] = annotation.confidence
--> 614 return obj.from_common(annotation.value, subclasses, annotation.name,
615 annotation.feature_schema_id, annotation.extra,
616 data, **optional_kwargs)

/databricks/python/lib/python3.8/site-packages/labelbox/data/serialization/ndjson/objects.py in from_common(cls, mask, classifications, name, feature_schema_id, extra, data, confidence)
410
411 return cls(mask=lbv1_mask,
--> 412 data_row=DataRow(id=data.uid, global_key=data.global_key),
413 name=name,
414 schema_id=feature_schema_id,

/databricks/python/lib/python3.8/site-packages/pydantic/main.cpython-38-x86_64-linux-gnu.so in pydantic.main.BaseModel.init()

ValidationError: 1 validation error for DataRow
root
Must set either id or global_key (type=value_error)`