Coder Social home page Coder Social logo

spark-ocr-workshop's Introduction

John Snow Labs: State-of-the-art NLP in Python

The John Snow Labs library provides a simple & unified Python API for delivering enterprise-grade natural language processing solutions:

  1. 15,000+ free NLP models in 250+ languages in one line of code. Production-grade, Scalable, trainable, and 100% open-source.
  2. Open-source libraries for Responsible AI (NLP Test), Explainable AI (NLP Display), and No-Code AI (NLP Lab).
  3. 1,000+ healthcare NLP models and 1,000+ legal & finance NLP models with a John Snow Labs license subscription.

Homepage: https://www.johnsnowlabs.com/

Docs & Demos: https://nlp.johnsnowlabs.com/

Features

Powered by John Snow Labs Enterprise-Grade Ecosystem:

  • πŸš€ Spark-NLP : State of the art NLP at scale!
  • πŸ€– NLU : 1 line of code to conquer NLP!
  • πŸ•Ά Visual NLP : Empower your NLP with a set of eyes!
  • πŸ’Š Healthcare NLP : Heal the world with NLP!
  • βš– Legal NLP : Bring justice with NLP!
  • πŸ’² Finance NLP : Understand Financial Markets with NLP!
  • 🎨 NLP-Display Visualize and Explain NLP!
  • πŸ“Š NLP-Test : Deliver Reliable, Safe and Effective Models!
  • πŸ”¬ NLP-Lab : No-Code Tool to Annotate & Train new Models!

Installation

! pip install johnsnowlabs

from johnsnowlabs import nlp
nlp.load('emotion').predict('Wow that was easy!')

See the documentation for more details.

Usage

These are examples of getting things done with one line of code. See the General Concepts Documentation for building custom pipelines.

# Example of Named Entity Recognition
nlp.load('ner').predict("Dr. John Snow is an British physician born in 1813")

Returns :

entities entities_class entities_confidence
John Snow PERSON 0.9746
British NORP 0.9928
1813 DATE 0.5841
# Example of Question Answering 
nlp.load('answer_question').predict("What is the capital of Paris")

Returns :

text answer
What is the capital of France Paris
# Example of Sentiment classification
nlp.load('sentiment').predict("Well this was easy!")

Returns :

text sentiment_class sentiment_confidence
Well this was easy! pos 0.999901
nlp.load('ner').viz('Bill goes to New York')

Returns:
ner_viz_opensource For a full overview see the 1-liners Reference and the Workshop.

Use Licensed Products

To use John Snow Labs' paid products like Healthcare NLP, [Visual NLP], [Legal NLP], or [Finance NLP], get a license key and then call nlp.install() to use it:

! pip install johnsnowlabs
# Install paid libraries via a browser login to connect to your account
from johnsnowlabs import nlp
nlp.install()
# Start a licensed session
nlp.start()
nlp.load('en.med_ner.oncology_wip').predict("Woman is on  chemotherapy, carboplatin 300 mg/m2.")

Usage

These are examples of getting things done with one line of code. See the General Concepts Documentation for building custom pipelines.

# visualize entity resolution ICD-10-CM codes 
nlp.load('en.resolve.icd10cm.augmented')
    .viz('Patient with history of prior tobacco use, nausea, nose bleeding and chronic renal insufficiency.')

returns:
ner_viz_opensource

# Temporal Relationship Extraction&Visualization
nlp.load('relation.temporal_events')\
    .viz('The patient developed cancer after a mercury poisoning in 1999 ')

returns: relationv_viz

Helpful Resources

Take a look at the official Johnsnowlabs page page: https://nlp.johnsnowlabs.com for user documentation and examples

Resource Description
General Concepts General concepts in the Johnsnowlabs library
Overview of 1-liners Most common used models and their results
Overview of 1-liners for healthcare Most common used healthcare models and their results
Overview of all 1-liner Notebooks 100+ tutorials on how to use the 1 liners on text datasets for various problems and from various sources like Twitter, Chinese News, Crypto News Headlines, Airline Traffic communication, Product review classifier training,
Connect with us on Slack Problems, questions or suggestions? We have a very active and helpful community of over 2000+ AI enthusiasts putting Johnsnowlabs products to good use
Discussion Forum More indepth discussion with the community? Post a thread in our discussion Forum
Github Issues Report a bug
Custom Installation Custom installations, Air-Gap mode and other alternatives
The nlp.load(<Model>) function Load any model or pipeline in one line of code
The nlp.load(<Model>).predict(data) function Predict on Strings, List of Strings, Numpy Arrays, Pandas, Modin and Spark Dataframes
The nlp.load(<train.Model>).fit(data) function Train a text classifier for 2-Class, N-Classes Multi-N-Classes, Named-Entitiy-Recognition or Parts of Speech Tagging
The nlp.load(<Model>).viz(data) function Visualize the results of Word Embedding Similarity Matrix, Named Entity Recognizers, Dependency Trees & Parts of Speech, Entity Resolution,Entity Linking or Entity Status Assertion
The nlp.load(<Model>).viz_streamlit(data) function Display an interactive GUI which lets you explore and test every model and feature in Johnsowlabs 1-liner repertoire in 1 click.

License

This library is licensed under the Apache 2.0 license. John Snow Labs' paid products are subject to this End User License Agreement.
By calling nlp.install() to add them to your environment, you agree to its terms and conditions.

spark-ocr-workshop's People

Contributors

achilah avatar albertoandreottiatgmail avatar aymanechilah avatar c-k-loan avatar chicoq avatar diatrambitas avatar fadi212 avatar gokhanturer avatar kolia1985 avatar mary-sci avatar mellahysf avatar meryem1425 avatar mozanunal avatar sihaama avatar xyutech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-ocr-workshop's Issues

OpenCV binaries fail to load

This started to happen randomly in different platforms(Databricks and Colab), it seems we depend on the opencv binary to be present in the platform(as a native lib) to be able to load open cv.
We should instead depend also on the binary shipped within our jar, which we probably should get from(this is in the classpath)

opencv-4.3.0-1.5.3-linux-x86.jar

but we apparently don't ship(may be an assembly issue?)

Error
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:431)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$2(PythonRDD.scala:327)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$1(PythonRDD.scala:345)
at org.apache.spark.api.python.PythonRDD$.$anonfun$toLocalIteratorAndServe$1$adapted(PythonRDD.scala:296)
at org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:115)
at org.apache.spark.security.SocketFuncServer.handleConnection(SocketAuthServer.scala:108)
at org.apache.spark.security.SocketAuthServer$$anon$1.$anonfun$run$1(SocketAuthServer.scala:62)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:62)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 1310, 10.164.248.13, executor 22): java.lang.UnsatisfiedLinkError: no opencv_java in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1875)
at java.lang.Runtime.loadLibrary0(Runtime.java:872)
at java.lang.System.loadLibrary(System.java:1124)
at org.opencv.osgi.OpenCVNativeLoader.init(OpenCVNativeLoader.java:15)
at org.dcm4che3.opencv.StreamSegment.(StreamSegment.java:76)
at org.dcm4che3.opencv.NativeJPEGImageReaderSpi.canDecodeInput(NativeJPEGImageReaderSpi.java:92)
at javax.imageio.ImageIO$CanDecodeInputFilter.filter(ImageIO.java:567)
at javax.imageio.spi.FilterIterator.advance(ServiceRegistry.java:834)
at javax.imageio.spi.FilterIterator.next(ServiceRegistry.java:852)
at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:528)
at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:513)
at com.johnsnowlabs.ocr.transformers.BinaryToImage.$anonfun$transformUDF$2(BinaryToImage.scala:30)
at scala.util.Try$.apply(Try.scala:213)
at com.johnsnowlabs.ocr.transformers.BinaryToImage.$anonfun$transformUDF$1(BinaryToImage.scala:26)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.generate_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733)
at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.hasNext(InMemoryRelation.scala:132)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)

Spark OCR

This is regarding an error we are facing while invoking the Table-detection model from Spark OCR. Looks like a known error but didn’t find much concrete solution from the issues’ logs.
Probably has to do with compatibility of the versions - tried Spark OCR 3.8 as suggested but ended up getting the same issue. Could you advise further?

binary_to_image.setImageType(ImageType.TYPE_3BYTE_BGR)
table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr").setInputCol("image").setOutputCol("region")

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.ocr.OcrPythonResourceDownloader.getDownloadSize.
: java.lang.NoClassDefFoundError: Could not initialize class com.johnsnowlabs.ocr.OcrPythonResourceDownloader$
at com.johnsnowlabs.ocr.OcrPythonResourceDownloader.getDownloadSize(OcrPythonResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Further details in the attached -
Model_loading_issue.docx

Issue with Converting Custom Pipeline to PretrainedPipeline

I’ve created a custom pipelines that uses DicomToImage as well as PdfToImage. The pipelines works as expected when run normally. However, I encounter an issue when I attempt to convert this custom pipeline into a PretrainedPipeline. The error message I receive is 'JavaPackage' object is not callable.

I’m looking for guidance on how to resolve this issue. Any help would be greatly appreciated. Thank you.

The error can be reproduced in this notebook:
https://colab.research.google.com/drive/1a_4VJXHvgDBsfD83Xg9u_6syy3wge8Vm?usp=sharing

AnalysisException: Path does not exist: file:/data/signature/*

imagePath = "./data/signature/*"
image_df = spark.read.format("binaryFile").load(imagePath)

I am facing a path issue while running the code in google colab. I have 30 day trial version right now.

Spark version: 3.0.2
Spark NLP version: 3.4.4
Spark OCR version: 3.14.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.