johnsnowlabs / spark-ocr-workshop Goto Github PK
View Code? Open in Web Editor NEWPublic runnable examples of using John Snow Labs' OCR for Apache Spark.
Public runnable examples of using John Snow Labs' OCR for Apache Spark.
Missing notebook for this demo: https://demo.johnsnowlabs.com/ocr/PDF_CHART_TO_TEXT/
I’ve created a custom pipelines that uses DicomToImage
as well as PdfToImage
. The pipelines works as expected when run normally. However, I encounter an issue when I attempt to convert this custom pipeline into a PretrainedPipeline. The error message I receive is 'JavaPackage' object is not callable
.
I’m looking for guidance on how to resolve this issue. Any help would be greatly appreciated. Thank you.
The error can be reproduced in this notebook:
https://colab.research.google.com/drive/1a_4VJXHvgDBsfD83Xg9u_6syy3wge8Vm?usp=sharing
imagePath = "./data/signature/*"
image_df = spark.read.format("binaryFile").load(imagePath)
I am facing a path issue while running the code in google colab. I have 30 day trial version right now.
Spark version: 3.0.2
Spark NLP version: 3.4.4
Spark OCR version: 3.14.0
I received this error for :
Spark version: 3.0.2
Spark NLP version: 3.0.1
Spark OCR version: 3.8.0
This started to happen randomly in different platforms(Databricks and Colab), it seems we depend on the opencv binary to be present in the platform(as a native lib) to be able to load open cv.
We should instead depend also on the binary shipped within our jar, which we probably should get from(this is in the classpath)
opencv-4.3.0-1.5.3-linux-x86.jar
but we apparently don't ship(may be an assembly issue?)
Error
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:431)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$2(PythonRDD.scala:327)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$1(PythonRDD.scala:345)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$1$adapted(PythonRDD.scala:296)
at org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:115)
at org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:108)
at org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$1(SocketAuthServer.scala:62)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:62)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 1310, 10.164.248.13, executor 22): java.lang.UnsatisfiedLinkError: no opencv_java in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1875)
at java.lang.Runtime.loadLibrary0(Runtime.java:872)
at java.lang.System.loadLibrary(System.java:1124)
at org.opencv.osgi.OpenCVNativeLoader.init(OpenCVNativeLoader.java:15)
at org.dcm4che3.opencv.StreamSegment.(StreamSegment.java:76)
at org.dcm4che3.opencv.NativeJPEGImageReaderSpi.canDecodeInput(NativeJPEGImageReaderSpi.java:92)
at javax.imageio.ImageIO$CanDecodeInputFilter.filter(ImageIO.java:567)
at javax.imageio.spi.FilterIterator.advance(ServiceRegistry.java:834)
at javax.imageio.spi.FilterIterator.next(ServiceRegistry.java:852)
at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:528)
at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:513)
at com.johnsnowlabs.ocr.transformers.BinaryToImage.$anonfun$transformUDF$2(BinaryToImage.scala:30)
at scala.util.Try$.apply(Try.scala:213)
at com.johnsnowlabs.ocr.transformers.BinaryToImage.$anonfun$transformUDF$1(BinaryToImage.scala:26)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.generate_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.hasNext(InMemoryRelation.scala:132)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
I tried all the things, but sparkocr I was not able to use in my colab notebook.
I am not able to access this: https://pypi.johnsnowlabs.com/
Gives this:
403 Forbidden
Code: AccessDenied
Message: Access Denied
RequestId: QGHN8VPV3 (truncated)
HostId: 3mdkTlnGi3YWyNhj (truncated)
Creating the notebook ChartToText powered by an open source LLM Deplot + LLM (LLAMA2)
This is regarding an error we are facing while invoking the Table-detection model from Spark OCR. Looks like a known error but didn’t find much concrete solution from the issues’ logs.
Probably has to do with compatibility of the versions - tried Spark OCR 3.8 as suggested but ended up getting the same issue. Could you advise further?
binary_to_image.setImageType(ImageType.TYPE_3BYTE_BGR)
table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr").setInputCol("image").setOutputCol("region")
Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.ocr.OcrPythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: Could not initialize class com.johnsnowlabs.ocr.OcrPythonResourceDownloader$
at com.johnsnowlabs.ocr.OcrPythonResourceDownloader.getDownloadSize(OcrPythonResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Further details in the attached -
Model_loading_issue.docx
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.