ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. ChemSpot is released under the Common Public License 1.0.
My question is not really an issue, I just want suggestions, if for example we want to annotate 1000 text files, how can I automate it?
java -Xmx16G -jar chemspot.jar -t sample.txt -o predict.txt
The above line tags one file but I want to tag 1000 files and I can't do it manually. What is the alternative to this?
Certain parameter combinations for the ChemSpot main application produce a strange and somewhat arbitrary behavior. This should be changed so that all parameters work as expected.
Find a graceful way to deal with the problem of more abstract entities such as "protein" or "molecule" in the CRAFT corpus, without just "removing unwanted entities" that ChemSpot doesn't find (wasn't designed to find)
Hi,
FYI I've quickly developed a wrapper around ChemSpot that offer it as a REST service. In this way you don't need 16Gb or memory everytime you need to tag a new document.
Hey, I can guess at what several of the columns in the output are, but have no idea for a number of them. Is there a data dictionary somewhere? E.g., col1 = 'this'; col2 = 'that', etc? Couldn't find such after a non-trivial search.
This is slow since a jcas is initialized each time we want to tag a string. Instead, hold back one pre-intitialized jcas and reset it each time this method gets called.
Exception in thread "main" java.io.IOException: There are no corpora defined.
at de.berlin.hu.chemspot.App.promptForCorpus(App.java:146)
at de.berlin.hu.chemspot.App.main(App.java:270)
Open the 'ChemSpot/chemspot-2.0/chemspot.jar' file using an archive manager (me, Ubuntu: Archive Manager); add the 'logging.properties' file to the following location:
When calling ChemSpot from within a different Java project, the dictionary fails to load. Perhaps this problem is related to #22?
Failed initializing ChemSpot.
org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class "de.berlin.hu.uima.ae.tagger.brics.BricsTagger" failed. (Descriptor: jar:file:/media/Data/workspaces/wbi/prototype/lib/chemspot.jar!/desc/ae/tagger/BricsTaggerAE.xml)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:254)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:158)
at org.uimafit.factory.AnalysisEngineFactory.createPrimitive(AnalysisEngineFactory.java:403)
at de.berlin.hu.chemspot.ChemSpot.(ChemSpot.java:118)
at ChemSpotRunner.main(ChemSpotRunner.java:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: org.apache.uima.resource.ResourceInitializationException
at de.berlin.hu.uima.ae.tagger.brics.BricsTagger.initialize(BricsTagger.java:59)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:252)
... 9 more
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:214)
at java.util.zip.ZipFile.(ZipFile.java:144)
at java.util.zip.ZipFile.(ZipFile.java:115)
at de.berlin.hu.uima.ae.tagger.brics.BricsMatcher.(BricsMatcher.java:42)
at de.berlin.hu.uima.ae.tagger.brics.BricsTagger.initialize(BricsTagger.java:55)
... 10 more
ChemSpot can be installed via Maven, but it would also be nice to automatically create a runnable jar, copy all required files and optionally tar/compress them
It seems that some resources are not properly loaded from within the jar, i.e.,
resources/chebi/chebi_ontology_fulldepth.txt
resources/prefixes.txt
resources/phare.txt
resources/suffixes-filtered.txt
Use this.getClass().getClassLoader().getResource(PATH TO FILE) to access them.
ChemSpot can be wonderful if we could use it with scala 2.11.8 in our NLP pipeline. However, we have encountered a problem that class file for scala.ScalaObject not found.
The reason is why de.berlin.hu.enumed.EntityTagger is using scala.ScalaObject and we could not change compiled code.
Is there any chance to get the source code of below maven dependency?
eumed
eumed-rg
1.0.0
If we can get the source code, we could update your ChemSpot code and can use it with state-of-art libraries.
A lot of false positives are produced by short terms like "IOP", "BMP", "CIA" or "SAM". Find a way to deal with these matches properly (and maybe separately, in a new component?).
Add option to load ChemSpot settings from configuration file. Particularly, it would be nice to also have parameters for turning certain components (e.g. SumTagger, CRF) on or off.