johnsnowlabs / spark-ocr-workshop Goto Github PK

Public runnable examples of using John Snow Labs' OCR for Apache Spark.

Jupyter Notebook 99.99% Scala 0.01% Shell 0.01% Python 0.01% Dockerfile 0.01%

spark-ocr-workshop's Issues

Create the notebook for PDF_CHART_TO_TEXT

Missing notebook for this demo: https://demo.johnsnowlabs.com/ocr/PDF_CHART_TO_TEXT/

Issue with Converting Custom Pipeline to PretrainedPipeline

I’ve created a custom pipelines that uses DicomToImage as well as PdfToImage. The pipelines works as expected when run normally. However, I encounter an issue when I attempt to convert this custom pipeline into a PretrainedPipeline. The error message I receive is 'JavaPackage' object is not callable.

I’m looking for guidance on how to resolve this issue. Any help would be greatly appreciated. Thank you.

The error can be reproduced in this notebook:
https://colab.research.google.com/drive/1a_4VJXHvgDBsfD83Xg9u_6syy3wge8Vm?usp=sharing

AnalysisException: Path does not exist: file:/data/signature/*

imagePath = "./data/signature/*"
image_df = spark.read.format("binaryFile").load(imagePath)

I am facing a path issue while running the code in google colab. I have 30 day trial version right now.

Spark version: 3.0.2
Spark NLP version: 3.4.4
Spark OCR version: 3.14.0

Py4JError: com.johnsnowlabs.ocr.transformers.VisualDocumentClassifier does not exist in the JVM

I received this error for :

Spark version: 3.0.2
Spark NLP version: 3.0.1
Spark OCR version: 3.8.0

OpenCV binaries fail to load

This started to happen randomly in different platforms(Databricks and Colab), it seems we depend on the opencv binary to be present in the platform(as a native lib) to be able to load open cv.
We should instead depend also on the binary shipped within our jar, which we probably should get from(this is in the classpath)

opencv-4.3.0-1.5.3-linux-x86.jar

but we apparently don't ship(may be an assembly issue?)

Error
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:431)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$2(PythonRDD.scala:327)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$1(PythonRDD.scala:345)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$1$adapted(PythonRDD.scala:296)
at org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:115)
at org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:108)
at org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$1(SocketAuthServer.scala:62)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:62)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 1310, 10.164.248.13, executor 22): java.lang.UnsatisfiedLinkError: no opencv_java in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1875)
at java.lang.Runtime.loadLibrary0(Runtime.java:872)
at java.lang.System.loadLibrary(System.java:1124)
at org.opencv.osgi.OpenCVNativeLoader.init(OpenCVNativeLoader.java:15)
at org.dcm4che3.opencv.StreamSegment.(StreamSegment.java:76)
at org.dcm4che3.opencv.NativeJPEGImageReaderSpi.canDecodeInput(NativeJPEGImageReaderSpi.java:92)
at javax.imageio.ImageIO$CanDecodeInputFilter.filter(ImageIO.java:567)
at javax.imageio.spi.FilterIterator.advance(ServiceRegistry.java:834)
at javax.imageio.spi.FilterIterator.next(ServiceRegistry.java:852)
at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:528)
at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:513)
at com.johnsnowlabs.ocr.transformers.BinaryToImage.$anonfun$transformUDF$2(BinaryToImage.scala:30)
at scala.util.Try$.apply(Try.scala:213)
at com.johnsnowlabs.ocr.transformers.BinaryToImage.$anonfun$transformUDF$1(BinaryToImage.scala:26)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.generate_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.hasNext(InMemoryRelation.scala:132)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)

ModuleNotFoundError: No module named 'sparkocr'

I tried all the things, but sparkocr I was not able to use in my colab notebook.

The link to get key is broken.

I am not able to access this: https://pypi.johnsnowlabs.com/

Gives this:

403 Forbidden
Code: AccessDenied
Message: Access Denied
RequestId: QGHN8VPV3 (truncated)
HostId: 3mdkTlnGi3YWyNhj (truncated)

Create the notebook ChartToText powered by open source LLM

Creating the notebook ChartToText powered by an open source LLM Deplot + LLM (LLAMA2)

Spark OCR

This is regarding an error we are facing while invoking the Table-detection model from Spark OCR. Looks like a known error but didn’t find much concrete solution from the issues’ logs.
Probably has to do with compatibility of the versions - tried Spark OCR 3.8 as suggested but ended up getting the same issue. Could you advise further?

binary_to_image.setImageType(ImageType.TYPE_3BYTE_BGR)
table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr").setInputCol("image").setOutputCol("region")

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.ocr.OcrPythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: Could not initialize class com.johnsnowlabs.ocr.OcrPythonResourceDownloader$
at com.johnsnowlabs.ocr.OcrPythonResourceDownloader.getDownloadSize(OcrPythonResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Further details in the attached -
Model_loading_issue.docx

johnsnowlabs / spark-ocr-workshop Goto Github PK

spark-ocr-workshop's Issues

Create the notebook for PDF_CHART_TO_TEXT

Issue with Converting Custom Pipeline to PretrainedPipeline

AnalysisException: Path does not exist: file:/data/signature/*

Py4JError: com.johnsnowlabs.ocr.transformers.VisualDocumentClassifier does not exist in the JVM

OpenCV binaries fail to load

ModuleNotFoundError: No module named 'sparkocr'

The link to get key is broken.

Create the notebook ChartToText powered by open source LLM

Spark OCR

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent